SlideShare a Scribd company logo
Apache Pig.
-- Jigar Parekh.
472062.
What is Pig:
• Pig is an open-source high-level dataflow
system.
• It provides simple language for queries and
data manipulation called Pig Latin.
• Internally it is complied into a Map-Reduce job
that are run on Hadoop.
• Similar to SQL query where the user specifies
the “What” and leaves the “How” to the
underlying processing engine.
Pig in Hadoop Eco System:
Pig sits on top of Map-Reduce layer.
Pig v/s Map-Reduce:
Map-Reduce Pig
MR is a compiled language. Pig is Scripting Language.
Java knowledge is needed. Java knowledge is not required, only may
be to write your own UDF.
Lots of hand coding. Pig uses already defined SQL like
functions or extend already existing UDFs.
Users much more comfortable to use MR
when dealing with the total Un-
Structured data.
Pig has problems dealing with the Un-
Structured data like Images, Videos, etc.
Who all are using PIG:
• 70% of production jobs at Yahoo (10ks per
day)
• Yahoo, Twitter, LinkedIn, Ebay, AOL,…
• Used to
– Process web logs
– Build user behavior models
– Build maps of the web
– Do research on large data sets
Accessing Pig:
• There are two modes in which we can access
Pig:
1) Local Mode: To run Pig in local mode, you
need access to a single machine.
2) Hadoop (Map-Reduce) Mode : To run Pig in
hadoop (map-reduce) mode, you need access to
a Hadoop cluster and HDFS installation.
Running Ways:
• Grunt Shell: Enter Pig commands manually using Pig’s interactive shell,
Grunt.
e.g: $ pig -x <local or mapreduce>
grunt>
• Script File: Place Pig commands in a script file and run the script.
e.g: $ pig -x <local or mapreduce> my_script.pig
• Embedded Program: Embed Pig commands in a host language and run
the program.
e.g: $ java -cp pig.jar:. Idlocal
$ java -cp pig.jar:.:$HADOOPDIR idhadoop
Note: ‘-x mapreduce’ keyword is optional if we want to run in the Hadoop
mode. Example: $ pig –x mapreduce is same as $ pig. Or
$ pig –x mapreduce my_script.pig is same as $ pig my_script.pig.
Data Types:
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or
10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
chararray Character array (string) in Unicode UTF-8 format hello world
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
Complex Types
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [name#John,phone#5551212]
Pig Execution:
• Pig scripts/commands follow the pattern as
given below:
Load
(Text, CSV, JSON,
Hive table.)
Transform
(Filter, Group,
Sort)
Store
(Dump, Store into
HDFS, Hive)
Loading Data in Pig:
• A = LOAD 'student' ;
• file_load = LOAD ‘/usr/tmp/student.txt' ;
• Z = LOAD 'student' USING PigStorage() AS (name : chararray, age : int, gpa : float);
• A = LOAD 'data' AS (f1 : int, f2 : int, B: bag {T : tuple (t1 : int, t2 : int)});
-- A / file_load / Z , here are called Relations.
-- LOAD, keyword used for loading data from HDFS into the Relation for processing / transformation.
-- ‘student’, The name of the file or directory, in single quotes. We can give full path name or
file_name* for all the similar filenames to be loaded.
-- USING is a function/keyword.
-- PigStorage() / TextLoader() / JsonLoader() / HCatloader(), we need to use appropriate function in
order for Pig to understand the incoming data. These are case Sensitive.
PigStorage() defaults to TAB separated data. If different, we need to specify the separator between the
parenthesis, example: PigStorage(‘,’), PigStorage(‘t’)
-- AS, is a keyword.
-- (name : chararray, ..), is called Schema.
Accessing the Relation
• Once the data is loaded into a Relation, there are two ways we can
access the data.
(1) Positional
(2) Schema names.
In the first example, the columns needs to be accessed by position as
the schema is not defined. The notation starts with $0 for the first
column, $1 for the second column and so on so forth.
In the next example, the schema is defined in terms of column names.
We can use either $0, $1 notation or we can use the column name as
is.
grunt> DESCRIBE A;
-- Does not produce any output since its Schema-less.
grunt> DESCRIBE Z;
Z: {name : chararray, age : int, gpa : float}
Data Transformation in Pig:
• Arithmetic Operators. [+, -, *, /, %, ? :]
• Relational Operators. [filter, group, order,
distinct, load, store, etc]
• Diagnostic Operators. [dump, describe, etc]
• Eval Functions. [count, max, min, concat, etc]
• Math Functions. [round, abs, floor, ceil, etc]
• String Functions. [lower, upper, substring,
trim, etc]
Relational Operators in Pig:
• The ones which are important to know are:
1) FILTER,
2) GROUP BY/ COGROUP BY,
3) LIMIT,
4) ORDER BY,
5) JOIN,
6) DISTINCT,
7) FOREACH GENERATE.
Relational Operator Examples:
-- Filter : It is similar to WHERE clause in SQL.
grunt> A = LOAD 'data' AS (f1 : int, f2 : int, f3 : int);
grunt> X = FILTER A BY f3 == 3;
grunt> Y = FILTER A BY (f1 == 8) OR (F2 == 10);
-- Group By / CoGroup By : GROUP BY is used for grouping the schemas in a single relation.
Where as, when we need to group two or more relations we need to use COGROUP BY.
grunt> A = load 'student' AS (name: chararray, age: int, gpa: float);
grunt> B = GROUP A BY age;
grunt> A = LOAD 'data1' AS (owner : chararray, pet : chararray);
grunt> DUMP A;
(Alice,turtle)
(Alice,goldfish)
(Alice,cat)
(Bob,dog)
(Bob,cat)
grunt> B = LOAD 'data2' AS (friend1 : chararray, friend2 : chararray);
grunt> DUMP B;
(Cindy,Alice)
(Mark,Alice)
(Paul,Bob)
(Paul,Jane)
grunt> X = COGROUP A BY owner, B BY friend2;
Output:
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
-- Join : Essentially GROUP and JOIN operators perform similar functions. GROUP
creates a nested set of output tuples while JOIN creates a flat set of output tuples.
• Types of Joins : Inner, Outer (Left, Right, Full), Replicated, Merge, Skewed.
• Examples:
grunt> A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt> DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> B = LOAD 'data2' AS (b1:int,b2:int);
grunt> DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
grunt> X = JOIN A BY a1, B BY b1;
grunt> DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9) …
• FOREACH .. GENERATE: Generates data transformations based on
columns of data.
• Generally it follows after Join, Group, Filter operators or Load, if
you want to work with only a select few columns.
• Example:
grunt> A = LOAD 'data' AS (f1:int,f2:int,f3:int);
grunt> DUMP A;
grunt> Y = FOREACH A GENERATE *; -- this will print the Relation A as is with all cols.
grunt> B = GROUP A BY f1;
grunt> DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
grunt> X = FOREACH B GENERATE group, COUNT(A) AS total;
(1,1)
(4,2)
(7,1)
(8,2)
Here ‘group’ is the first col of the
Grouped output and is named
implicitly by Pig. It points to the
values 1,4,7 and 8.
-- Limit: Limits the number of output tuples. If the
specified number of output tuples is equal to or
exceeds the number of tuples in the relation, all
tuples in the relation are returned.
Example: grunt> X = LIMIT A 3;
grunt> DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
Note: For Top N analysis, use ORDER BY (asc or desc)
and then Limit the output.
-- Distinct : Removes duplicate tuples in a relation.
grunt> A = LOAD 'data' AS (a1:int,a2:int,a3:int);
grunt> DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
grunt> X = DISTINCT A;
grunt> DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
-- Order By : Sorts a relation based on one or more fields. ORDER BY is NOT stable; if multiple
records have the same ORDER BY key, the order in which these records are returned is not
defined and is not guaranteed to be the same from one run to the next.
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> X = ORDER A BY a3 DESC;
grunt> DUMP X;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Arithmetic Operators in Pig:
• We have the standard arithmetic operators which
Pig uses. They are:
1) Addition (+)
2) Subtraction (-)
3) Multiplication (*)
4) Division (/)
5) Modulo (%)
6) Bincond (? :) [(condition ? value_if_true : value_if_false)]
7) Case (CASE WHEN THEN ELSE END)
• Examples:
grunt> X = FOREACH A GENERATE f1, f2, f1+f2 AS f4;
grunt> X = FOREACH A GENERATE f2, (f2==1 ? 1: f3);
grunt> X = FOREACH A GENERATE f2,
( CASE WHEN f2 % 2 == 0 THEN 'even'
WHEN f2 % 2 == 1 THEN 'odd‘
END );
• The above CASE statement can be written as :
grunt> X = FOREACH A GENERATE f2,
( CASE f2 % 2 WHEN 0 THEN 'even'
WHEN 1 THEN 'odd'
END );
• Abs : Returns the absolute value of an expression.
Example: abs(int a), abs(float b)
• Ceil : Returns the value of the expression rounded up to
the nearest integer.
Example: ceil(4.6), ceil(1.0), ceil(-2.4)
• Floor : Returns the value of the expression rounded down
to the nearest integer.
Example: floor(4.6), floor(1.0), floor(-2.4)
• Round : Returns the value of an expression rounded to an
integer.
Example: round(4.6), round(1.0), round(-2.4)
• SQRT : Returns the positive square root of an expression
Example: SQRT(5)
Math Functions in Pig:
String Functions in Pig:
• Lower / Upper : Converts all characters in a string to lower /
upper case.
• LTRIM / RTRIM / TRIM : Returns a copy of a string with
leading / trailing / or both, white space removed.
• SUBSTRING : Returns a substring from a given string.
Syntax : SUBSTRING(string, startIndex, stopIndex)
Example : SUBSTRING(ABCDEF,1,4) => BCD. Here the start-
index starts with 0 and stop-index should be following the last
char we want.
• REPLACE : Replaces existing characters in a string with new
characters.
Syntax : REPLACE(string, 'oldChar', 'newChar');
Eval Functions in Pig:
• Usually the Eval functions operate on ‘bag’ datatype. So we need to
Group By before applying the functions.
• Count / Count_Star : Computes the number of elements in a bag. The
COUNT function ignores nulls. If you want to include NULL values in the
count computation, use COUNT_STAR. The output datatype will always be
of type Long.
Example : DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
X = FOREACH B GENERATE COUNT(A);
DUMP X;
(1L)
(2L)
(1L)
(2L)
• Min / Max : Computes the minimum / maximum of the numeric values or chararrays in a single-column bag. In
the below example the single-column is GPA.
Example :
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MAX(A.gpa);
DUMP X;
(John,4.0F)
(Mary,4.0F)
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
Storing Data from Pig :
• Store functions determine how the data comes out of pig.
• PigStorage() :
1. Stores data in UTF-8 format.
2. PigStorage is the default function for the STORE operator
and works with both simple and complex data types.
3. PigStorage supports structured text files (in human-
readable UTF-8 format).
4. The default field delimiter is tab ('t'). You can also specify
other characters as delimiters but within single quotes.
Example : STORE X INTO 'output' USING PigStorage('*');
• HCatStorer() :
1. HCatStorer is used with Pig scripts to write data to HCatalog-
managed tables ( Read : Hive).
2. To bring in the appropriate jars for working with HCatalog, simply
include the following flag / parameters when running Pig from
the shell:
pig –useHCatalog
3. The fully qualified package name is:
org.apache.hive.hcatalog.pig.HCatStorer
Example :
STORE processed_data INTO 'tablename' USING
org.apache.hive.hcatalog.pig.HCatStorer();
A = LOAD 'tablename' USING
org.apache.hive.hcatalog.pig.HCatLoader();
Link :
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadSto
re
User Defined Functions ( UDF ) :
• If the requirement is such that which cannot be fulfilled by
already existing operators / functions, then the user has an
option to write his own.
• Pig provides extensive support for user defined functions
(UDFs) as a way to specify custom processing.
• Pig UDFs can currently be implemented in three languages:
Java, Python, and JavaScript.
• You can customize all parts of the processing including data
load/store, column transformation, and aggregation.
• Pig also provides support for Piggy Bank, a repository for JAVA
UDFs. Through Piggy Bank you can access Java UDFs written
by other users and also contribute your java UDFs that you
have written.
• Please explore this Piggy Bank option before writing your own
Function as someone might already had coded for it.
Pig Example:
Word Count in Pig:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)
TOKENIZE:
({(This),(is),(a),(hadoop),(class)})
({(hadoop),(is),(a),(bigdata),(technology)})
Flatten :
(This)
(is)
(a)
(hadoop)
(class) ….
Summary :
• Pig is an open-source high-level language.
• It sits above Map Reduce to simplify coding.
• Three main blocks of processing data :
– Load
– Transform
– Store.
• Pig can Load and Store from different sources
like DFS, Hive, etc.
• User can write UDFs to extend the functionality.
References :
• Pig Manual :
https://pig.apache.org/docs/r0.7.0/index.html
• Books :
– Programming Pig by Oreilly
Thank You!

More Related Content

What's hot

Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
johnynek
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
Zheng Shao
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
vishal choudhary
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
Fernando Rodriguez
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
AnandMHadoop
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
Glenn K. Lockwood
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
BigDataEverywhere
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
Charles Givre
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
Dan Morrill
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern PerlDave Cross
 

What's hot (20)

Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 

Similar to Apache pig

Pig workshop
Pig workshopPig workshop
Pig workshop
Sudar Muthu
 
Apache PIG
Apache PIGApache PIG
Apache PIG
Prashant Gupta
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Pig Introduction to Pig
Pig Introduction to PigPig Introduction to Pig
Pig Introduction to Pig
Chris Wilkes
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Anne Nicolas
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
Mathias Herberts
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
Jianfeng Zhang
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
Simon Proctor
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
Simon Proctor
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
Jeff Hammerbacher
 
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
Sopan Shewale
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Kang-min Liu
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to PerlSway Wang
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
osfameron
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
Jay Coskey
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
Rahul Borate
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
Ahmed Salama
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
Asif Ali
 

Similar to Apache pig (20)

Pig workshop
Pig workshopPig workshop
Pig workshop
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Pig Introduction to Pig
Pig Introduction to PigPig Introduction to Pig
Pig Introduction to Pig
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 

Apache pig

  • 1. Apache Pig. -- Jigar Parekh. 472062.
  • 2. What is Pig: • Pig is an open-source high-level dataflow system. • It provides simple language for queries and data manipulation called Pig Latin. • Internally it is complied into a Map-Reduce job that are run on Hadoop. • Similar to SQL query where the user specifies the “What” and leaves the “How” to the underlying processing engine.
  • 3. Pig in Hadoop Eco System: Pig sits on top of Map-Reduce layer.
  • 4. Pig v/s Map-Reduce: Map-Reduce Pig MR is a compiled language. Pig is Scripting Language. Java knowledge is needed. Java knowledge is not required, only may be to write your own UDF. Lots of hand coding. Pig uses already defined SQL like functions or extend already existing UDFs. Users much more comfortable to use MR when dealing with the total Un- Structured data. Pig has problems dealing with the Un- Structured data like Images, Videos, etc.
  • 5. Who all are using PIG: • 70% of production jobs at Yahoo (10ks per day) • Yahoo, Twitter, LinkedIn, Ebay, AOL,… • Used to – Process web logs – Build user behavior models – Build maps of the web – Do research on large data sets
  • 6. Accessing Pig: • There are two modes in which we can access Pig: 1) Local Mode: To run Pig in local mode, you need access to a single machine. 2) Hadoop (Map-Reduce) Mode : To run Pig in hadoop (map-reduce) mode, you need access to a Hadoop cluster and HDFS installation.
  • 7. Running Ways: • Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt. e.g: $ pig -x <local or mapreduce> grunt> • Script File: Place Pig commands in a script file and run the script. e.g: $ pig -x <local or mapreduce> my_script.pig • Embedded Program: Embed Pig commands in a host language and run the program. e.g: $ java -cp pig.jar:. Idlocal $ java -cp pig.jar:.:$HADOOPDIR idhadoop Note: ‘-x mapreduce’ keyword is optional if we want to run in the Hadoop mode. Example: $ pig –x mapreduce is same as $ pig. Or $ pig –x mapreduce my_script.pig is same as $ pig my_script.pig.
  • 8. Data Types: Simple Types Description Example int Signed 32-bit integer 10 long Signed 64-bit integer Data: 10L or 10l Display: 10L float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 chararray Character array (string) in Unicode UTF-8 format hello world bytearray Byte array (blob) boolean boolean true/false (case insensitive) Complex Types tuple An ordered set of fields. (19,2) bag An collection of tuples. {(19,2), (18,1)} map A set of key value pairs. [name#John,phone#5551212]
  • 9. Pig Execution: • Pig scripts/commands follow the pattern as given below: Load (Text, CSV, JSON, Hive table.) Transform (Filter, Group, Sort) Store (Dump, Store into HDFS, Hive)
  • 10. Loading Data in Pig: • A = LOAD 'student' ; • file_load = LOAD ‘/usr/tmp/student.txt' ; • Z = LOAD 'student' USING PigStorage() AS (name : chararray, age : int, gpa : float); • A = LOAD 'data' AS (f1 : int, f2 : int, B: bag {T : tuple (t1 : int, t2 : int)}); -- A / file_load / Z , here are called Relations. -- LOAD, keyword used for loading data from HDFS into the Relation for processing / transformation. -- ‘student’, The name of the file or directory, in single quotes. We can give full path name or file_name* for all the similar filenames to be loaded. -- USING is a function/keyword. -- PigStorage() / TextLoader() / JsonLoader() / HCatloader(), we need to use appropriate function in order for Pig to understand the incoming data. These are case Sensitive. PigStorage() defaults to TAB separated data. If different, we need to specify the separator between the parenthesis, example: PigStorage(‘,’), PigStorage(‘t’) -- AS, is a keyword. -- (name : chararray, ..), is called Schema.
  • 11. Accessing the Relation • Once the data is loaded into a Relation, there are two ways we can access the data. (1) Positional (2) Schema names. In the first example, the columns needs to be accessed by position as the schema is not defined. The notation starts with $0 for the first column, $1 for the second column and so on so forth. In the next example, the schema is defined in terms of column names. We can use either $0, $1 notation or we can use the column name as is. grunt> DESCRIBE A; -- Does not produce any output since its Schema-less. grunt> DESCRIBE Z; Z: {name : chararray, age : int, gpa : float}
  • 12. Data Transformation in Pig: • Arithmetic Operators. [+, -, *, /, %, ? :] • Relational Operators. [filter, group, order, distinct, load, store, etc] • Diagnostic Operators. [dump, describe, etc] • Eval Functions. [count, max, min, concat, etc] • Math Functions. [round, abs, floor, ceil, etc] • String Functions. [lower, upper, substring, trim, etc]
  • 13. Relational Operators in Pig: • The ones which are important to know are: 1) FILTER, 2) GROUP BY/ COGROUP BY, 3) LIMIT, 4) ORDER BY, 5) JOIN, 6) DISTINCT, 7) FOREACH GENERATE.
  • 14. Relational Operator Examples: -- Filter : It is similar to WHERE clause in SQL. grunt> A = LOAD 'data' AS (f1 : int, f2 : int, f3 : int); grunt> X = FILTER A BY f3 == 3; grunt> Y = FILTER A BY (f1 == 8) OR (F2 == 10); -- Group By / CoGroup By : GROUP BY is used for grouping the schemas in a single relation. Where as, when we need to group two or more relations we need to use COGROUP BY. grunt> A = load 'student' AS (name: chararray, age: int, gpa: float); grunt> B = GROUP A BY age; grunt> A = LOAD 'data1' AS (owner : chararray, pet : chararray); grunt> DUMP A; (Alice,turtle) (Alice,goldfish) (Alice,cat) (Bob,dog) (Bob,cat) grunt> B = LOAD 'data2' AS (friend1 : chararray, friend2 : chararray); grunt> DUMP B; (Cindy,Alice) (Mark,Alice) (Paul,Bob) (Paul,Jane) grunt> X = COGROUP A BY owner, B BY friend2; Output: (Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)}) (Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)}) (Jane,{},{(Paul,Jane)})
  • 15. -- Join : Essentially GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples. • Types of Joins : Inner, Outer (Left, Right, Full), Replicated, Merge, Skewed. • Examples: grunt> A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt> DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) grunt> B = LOAD 'data2' AS (b1:int,b2:int); grunt> DUMP B; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) grunt> X = JOIN A BY a1, B BY b1; grunt> DUMP X; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) …
  • 16. • FOREACH .. GENERATE: Generates data transformations based on columns of data. • Generally it follows after Join, Group, Filter operators or Load, if you want to work with only a select few columns. • Example: grunt> A = LOAD 'data' AS (f1:int,f2:int,f3:int); grunt> DUMP A; grunt> Y = FOREACH A GENERATE *; -- this will print the Relation A as is with all cols. grunt> B = GROUP A BY f1; grunt> DUMP B; (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) grunt> X = FOREACH B GENERATE group, COUNT(A) AS total; (1,1) (4,2) (7,1) (8,2) Here ‘group’ is the first col of the Grouped output and is named implicitly by Pig. It points to the values 1,4,7 and 8.
  • 17. -- Limit: Limits the number of output tuples. If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, all tuples in the relation are returned. Example: grunt> X = LIMIT A 3; grunt> DUMP X; (1,2,3) (4,3,3) (7,2,5) Note: For Top N analysis, use ORDER BY (asc or desc) and then Limit the output.
  • 18. -- Distinct : Removes duplicate tuples in a relation. grunt> A = LOAD 'data' AS (a1:int,a2:int,a3:int); grunt> DUMP A; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) grunt> X = DISTINCT A; grunt> DUMP X; (1,2,3) (4,3,3) (8,3,4) -- Order By : Sorts a relation based on one or more fields. ORDER BY is NOT stable; if multiple records have the same ORDER BY key, the order in which these records are returned is not defined and is not guaranteed to be the same from one run to the next. (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) grunt> X = ORDER A BY a3 DESC; grunt> DUMP X; (7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
  • 19. Arithmetic Operators in Pig: • We have the standard arithmetic operators which Pig uses. They are: 1) Addition (+) 2) Subtraction (-) 3) Multiplication (*) 4) Division (/) 5) Modulo (%) 6) Bincond (? :) [(condition ? value_if_true : value_if_false)] 7) Case (CASE WHEN THEN ELSE END)
  • 20. • Examples: grunt> X = FOREACH A GENERATE f1, f2, f1+f2 AS f4; grunt> X = FOREACH A GENERATE f2, (f2==1 ? 1: f3); grunt> X = FOREACH A GENERATE f2, ( CASE WHEN f2 % 2 == 0 THEN 'even' WHEN f2 % 2 == 1 THEN 'odd‘ END ); • The above CASE statement can be written as : grunt> X = FOREACH A GENERATE f2, ( CASE f2 % 2 WHEN 0 THEN 'even' WHEN 1 THEN 'odd' END );
  • 21. • Abs : Returns the absolute value of an expression. Example: abs(int a), abs(float b) • Ceil : Returns the value of the expression rounded up to the nearest integer. Example: ceil(4.6), ceil(1.0), ceil(-2.4) • Floor : Returns the value of the expression rounded down to the nearest integer. Example: floor(4.6), floor(1.0), floor(-2.4) • Round : Returns the value of an expression rounded to an integer. Example: round(4.6), round(1.0), round(-2.4) • SQRT : Returns the positive square root of an expression Example: SQRT(5) Math Functions in Pig:
  • 22. String Functions in Pig: • Lower / Upper : Converts all characters in a string to lower / upper case. • LTRIM / RTRIM / TRIM : Returns a copy of a string with leading / trailing / or both, white space removed. • SUBSTRING : Returns a substring from a given string. Syntax : SUBSTRING(string, startIndex, stopIndex) Example : SUBSTRING(ABCDEF,1,4) => BCD. Here the start- index starts with 0 and stop-index should be following the last char we want. • REPLACE : Replaces existing characters in a string with new characters. Syntax : REPLACE(string, 'oldChar', 'newChar');
  • 23. Eval Functions in Pig: • Usually the Eval functions operate on ‘bag’ datatype. So we need to Group By before applying the functions. • Count / Count_Star : Computes the number of elements in a bag. The COUNT function ignores nulls. If you want to include NULL values in the count computation, use COUNT_STAR. The output datatype will always be of type Long. Example : DUMP B; (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) X = FOREACH B GENERATE COUNT(A); DUMP X; (1L) (2L) (1L) (2L)
  • 24. • Min / Max : Computes the minimum / maximum of the numeric values or chararrays in a single-column bag. In the below example the single-column is GPA. Example : A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float); DUMP A; (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) B = GROUP A BY name; DUMP B; (John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)}) (Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)}) X = FOREACH B GENERATE group, MAX(A.gpa); DUMP X; (John,4.0F) (Mary,4.0F) C = FOREACH B GENERATE A.name, AVG(A.gpa); DUMP C; ({(John),(John),(John),(John)},3.850000023841858) ({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
  • 25. Storing Data from Pig : • Store functions determine how the data comes out of pig. • PigStorage() : 1. Stores data in UTF-8 format. 2. PigStorage is the default function for the STORE operator and works with both simple and complex data types. 3. PigStorage supports structured text files (in human- readable UTF-8 format). 4. The default field delimiter is tab ('t'). You can also specify other characters as delimiters but within single quotes. Example : STORE X INTO 'output' USING PigStorage('*');
  • 26. • HCatStorer() : 1. HCatStorer is used with Pig scripts to write data to HCatalog- managed tables ( Read : Hive). 2. To bring in the appropriate jars for working with HCatalog, simply include the following flag / parameters when running Pig from the shell: pig –useHCatalog 3. The fully qualified package name is: org.apache.hive.hcatalog.pig.HCatStorer Example : STORE processed_data INTO 'tablename' USING org.apache.hive.hcatalog.pig.HCatStorer(); A = LOAD 'tablename' USING org.apache.hive.hcatalog.pig.HCatLoader(); Link : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadSto re
  • 27. User Defined Functions ( UDF ) : • If the requirement is such that which cannot be fulfilled by already existing operators / functions, then the user has an option to write his own. • Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. • Pig UDFs can currently be implemented in three languages: Java, Python, and JavaScript. • You can customize all parts of the processing including data load/store, column transformation, and aggregation. • Pig also provides support for Piggy Bank, a repository for JAVA UDFs. Through Piggy Bank you can access Java UDFs written by other users and also contribute your java UDFs that you have written. • Please explore this Piggy Bank option before writing your own Function as someone might already had coded for it.
  • 28. Pig Example: Word Count in Pig: lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount; (a,2) (is,2) (This,1) (class,1) (hadoop,2) (bigdata,1) (technology,1) TOKENIZE: ({(This),(is),(a),(hadoop),(class)}) ({(hadoop),(is),(a),(bigdata),(technology)}) Flatten : (This) (is) (a) (hadoop) (class) ….
  • 29. Summary : • Pig is an open-source high-level language. • It sits above Map Reduce to simplify coding. • Three main blocks of processing data : – Load – Transform – Store. • Pig can Load and Store from different sources like DFS, Hive, etc. • User can write UDFs to extend the functionality.
  • 30. References : • Pig Manual : https://pig.apache.org/docs/r0.7.0/index.html • Books : – Programming Pig by Oreilly