SlideShare a Scribd company logo
1 of 79
Download to read offline
What is
Pig?
• Apache Pig is an abstraction over MapReduce.
• It is a tool/platform which is used to analyze larger sets of
data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all
the data manipulation operations in Hadoop using
Apache Pig.
• To write data analysis programs, Pig provides a high-level
language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for reading,
writing, and processing data.
Pig Architecture &
Components
• To analyze data using Apache Pig, programmers need
to write scripts using Pig Latin language.
• All these scripts are internally converted to Map and
Reduce tasks.
• Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those
scripts into MapReduce jobs.
Features of
Pig
• Rich set of operators: It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• UDF’s: Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed
them in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Apache Pig Vs
Hive
• Both Apache Pig and Hive are used to create MapReduce jobs. And in
some cases, Hive operates on HDFS in a similar way Apache Pig does.
Pig Latin – Data
Model
Pig Execution
Modes
• You can run Apache Pig in two modes.
• Local Mode
– In this mode, all the files are installed and run from your local
host and local file system. There is no need of Hadoop or
HDFS. This mode is generally used for testing purpose.
• MapReduce Mode
– MapReduce mode is where we load or process the data that exists
in the Hadoop File System (HDFS) using Apache Pig. In this mode,
whenever we execute the Pig Latin statements to process the data,
a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
Invoking the Grunt
Shell
• Local Mode
• $ pig –x local
• MapReduce
mode
• $ pig -x
mapreduce
(or)
pig
Execution
Mechanisms
• Interactive Mode (Grunt shell) – You can run Apache Pig in
interactive mode using the Grunt shell. In this shell, you can
enter the Pig Latin statements and get the output (using Dump
operator).
• Batch Mode (Script) – You can run Apache Pig in Batch mode by
writing the Pig Latin script in a single file with .pig extension.
• Embedded Mode (UDF) – Apache Pig provides the provision of
defining our own functions (User Defined Functions) in
programming languages such as Java, and using them in our script.
• Interactive Mode:
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',');
grunt> dump customers;
• Batch Mode (Local):
[cloudera@quickstart ~]$ cat pig_samplescript_local.pig
customers= LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;
[cloudera@quickstart ~]$ pig -x local
pig_samplescript_local.pig
• Batch Mode (HDFS):
[cloudera@quickstart ~]$ cat pig_samplescript_global.pig
customers= LOAD '/training/customers.txt' USING
PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;
[cloudera@quickstart ~]$ pig -x mapreduce
pig_samplescript_global.pig
Diagnostic
Operators
• The load statement will simply load the data into the
specified relation in Apache Pig. To verify the execution
of
the Load statement, you have to use the Diagnostic
Operators.
• Pig Latin provides four different types of diagnostic
operators:
– Dump operator
– Describe operator
– Explanation operator
• Dump operator
• The Dump operator is used to run the Pig Latin
statements and display the results on the screen. It is
generally used for debugging Purpose.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> dump customers;
• Describe operator
• The describe operator is used to view the
schema of a relation/bag.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> describe customers;
customers: {id: int,name: chararray,age:
int,address:
chararray,salary: int}
• Explain operator
• The explain operator is used to display the logical,
physical, and MapReduce execution plans of a
relation/bag.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> explain customers;
• Illustrate operator
• The illustrate operator is used to display the logical,
physical, and MapReduce execution plans of a
relation/bag.
grunt> customers= LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> illustrate customers;
Grouping &
Joining
Group
Operator
• The GROUP operator is used to group the data in one
or more relations. It collects the data having the same
key.
• grunt> student_details = LOAD '/home/cloudera/students.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
• grunt> student_groupdata = GROUP student_details by age;
• grunt> dump student_groupdata;
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)}
)
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram)})
• grunt> describe student_groupdata;
student_groupdata: {group: int,student_details: {(id: int,firstname:
chararray,lastname: chararray,age: int,phone: chararray,city:
chararray)}}
• grunt> Illustrate student_groupdata;
Grouping by Multiple
Columns
• grunt> student_details = LOAD
'/home/cloudera/students.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> student_multiplegroup = GROUP
student_details by (age, city);
• grunt> dump student_multiplegroup;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)}
)
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)}
) ((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Join
Operator
• The JOIN operator is used to combine records from
two or more relations.
• Types of Joins:
– Self-join
– Inner-join
– Outer join : left join, right join, full join
Self
Join
• customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as
(id:int,
name:chararray, age:int, address:chararray, salary:int);
• orders = LOAD '/home/local/orders.txt' USING PigStorage(',') as (oid:int,
date:chararray, customer_id:int, amount:int);
• grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int,
name:chararray, age:int, address:chararray, salary:int);
• grunt> customers2 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
• grunt> Dump customers3;
Inner Join
(equijoin)
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> customer_orders = JOIN customers BY id, orders BY
customer_id;
• grunt> dump customer_orders;
Left Outer
Join
• The left outer Join operation returns all rows from the left table,
even if there are no matches in the right relation.
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> outer_left = JOIN customers BY id LEFT OUTER,
orders BY customer_id;
• grunt> Dump outer_left;
Right Outer
Join
• The right outer join operation returns all rows from the right table,
even if there are no matches in the left table.
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> outer_right = JOIN customers BY id RIGHT, orders BY
customer_id;
• grunt> Dump outer_right;
Full Outer
Join
• The full outer join operation returns rows when there is a match in
one of the relations.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY
customer_id;
• grunt> Dump outer_full;
Cross
Operator
• grunt> customers = LOAD '/home/cloudera/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt'
USING PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);
• grunt> cross_data = CROSS customers, orders;
• grunt> Dump cross_data;
Combining &
Splitting
Union
Operator
• The UNION operator of Pig Latin is used to merge the
content of two relations. To perform UNION operation
on two relations, their columns and domains must be
identical.
• grunt> student1 = LOAD
'/home/cloudera/student_data1.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student2 = LOAD
'/home/cloudera/student_data2.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student = UNION student1, student2;
• grunt> dump student;
Split
Operator
• he SPLIT operator is used to split a relation into two or
more relations.
• grunt> student_details = LOAD '/home/cloudera/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
• Let us now split the relation into two, one listing the students age
less than 23, and the other listing the students having the age
between 23 and 25.
• SPLIT student_details into student_details1 if age<23,
student_details2 if (age>23 and age<25);
• grunt> Dump student_details1;
• grunt> Dump student_details2;
Filterin
g
Filter
Operator
• The FILTER operator is used to select the required
tuples from a relation based on a condition.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> filter_data = FILTER student_details BY city ==
'Chennai';
• grunt> dump filter_data;
Distinct
Operator
• The DISTINCT operator is used to remove
redundant (duplicate) tuples from a relation.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> distinct_data = DISTINCT student_details;
• grunt> dump distinct_data;
Foreach
Operator
• The FOREACH operator is used to generate
specified data transformations based on the column
data.
grunt> student_details = LOAD
‘/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int,
phone:chararray, city:chararray);
• get the id, age, and city values of each student from
the relation student_details and store it into
another relation named foreach_data using the
foreach operator.
Sortin
g
Order
By
• The ORDER BY operator is used to display the
contents of a relation in a sorted order based on one
or more fields.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> order_by_data = ORDER student_details BY age
DESC;
Limit
Operator
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray,
city:chararray);
• grunt> limit_data = LIMIT student_details 4;
Pig Latin Built-In
Functions
• Eval Functions
• String Functions
• Date-time
Functions
• Math Functions
Eval
Functions
• Avg()
• CONCAT()
• COUNT()
• COUNT_STA
R()
• DIFF()
• MAX()
• MIN()
• SIZE()
• SUBTRACT()
• SUM()
AVG
()
• Computes the average of the numeric values in a single-column bag.
• grunt> A = LOAD '/home/cloudera/student.txt' USING PigStorage(',') as
(name:chararray, term:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
• grunt> DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
CONCA
T()
• Concatenates two expressions of identical type.
• grunt>A = LOAD ‘/home/Cloudera/data.txt' as
(f1:chararray, f2:chararray, f3:chararray);
• grunt>DUMP A;
(apache,open,sourc
e)
(hadoop,map,reduc
e) (pig,pig,latin)
• grunt>X = FOREACH A GENERATE CONCAT(f2,f3);
• grunt>DUMP X;
(opensourc
e)
(mapreduce
COUN
T
• Computes the number of elements in a bag.
• Note: You cannot use the tuple designator (*) with
COUNT; that is, COUNT(*) will not work.
• grunt>A = LOAD '/home/cloudera/c.txt' USING PigStorage(',') as (f1:int, f2:int,
f3:int);
• grunt>DUMP A;
1,2,3
4,2,nul
l 8,3,4
4,3,nul
l
7,5,nul
l 8,4,3
• grunt>B = GROUP A BY f1;
• grunt>DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
• grunt>X = FOREACH B GENERATE COUNT(A);
• grunt>DUMP X;
(1L)
(2L)
(1L)
COUNT_ST
AR
• Computes the number of elements in a bag.
• COUNT_STAR includes NULL values in the count
computation (unlike COUNT, which ignores NULL
values).
• Example
• In this example COUNT_STAR is used the count the
tuples in a bag.
• grunt>X = FOREACH B GENERATE COUNT_STAR(A);
DIF
F
• Compares two fields in a tuple.
• grunt> A = LOAD ‘/home/Cloudera/data.txt' AS
(B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:i
nt)});
• grunt> DUMP A;
({(8,9),(0,1)},{(8,9),(1,1
)})
({(2,3),(4,5)},{(2,3),(4,5
)})
({(6,7),(3,7)},{(2,2),(3,7
)})
• grunt> DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
• grunt> X = FOREACH A DIFF(B1,B2);
• grunt> dump
X;
MA
X
• Computes the maximum of the numeric values or
chararrays in a single-column bag. MAX requires a
preceding GROUP ALL statement for global maximums
and a GROUP BY statement for group maximums.
• Example
– In this example the maximum GPA for all terms is computed
for each student (see the GROUP operator for information
about the field names in relation B).
• grunt> A = LOAD ‘home/Cloudera/student.txt' AS (name:chararray, session:chararray,
gpa:float);
• grunt> DUMP
A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3
.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,
4.0F)})
MI
N
• Computes the minimum of the numeric values or
chararrays in a single-column bag. MIN requires a
preceding GROUP… ALL statement for global
minimums and a GROUP … BY statement for group
minimums.
• Example
– In this example the minimum GPA for all terms is computed for
each student (see the GROUP operator for information about
the field names in relation B).
• grunt> A = LOAD ‘/home/Cloudera/student.txt' AS (name:chararray, session:chararray,
gpa:float);
• grunt> DUMP
A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3
.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,
4.0F)})
• grunt> X = FOREACH B GENERATE group, MIN(A.gpa);
• grunt> DUMP
X; (John,3.7F)
(Mary,3.8F)
SIZ
E
• Computes the number of elements based on any Pig data type.
• Example
• In this example the number of characters in the first field is
computed.
• grunt> A = LOAD 'data' as (f1:chararray, f2:chararray,
f3:chararray);
(apache,open,sourc
e)
(hadoop,map,reduc
e) (pig,pig,latin)
• grunt> X = FOREACH A GENERATE SIZE(f1);
• grunt> DUMP X;
(6L)
(6L)
(3L)
SU
M
• Computes the sum of the numeric values in a
single-column bag. SUM requires a preceding
GROUP ALL statement for global sums and a
GROUP BY statement for group sums.
• Example
• In this example the number of pets is computed.
• grunt> A = LOAD ‘/home/Cloudera/data' AS (owner:chararray,
pet_type:chararray, pet_num:int);
• grunt> DUMP
A;
(Alice,turtle,1)
(Alice,goldfish,5
) (Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
• grunt> B = GROUP A BY owner;
• grunt> DUMP B;
(Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat,
2)}) (Bob,{(Bob,dog,2),(Bob,cat,2)})
• grunt> X = FOREACH B GENERATE group, SUM(A.pet_num);
• DUMP
X;
(Alice,8L
)
(Bob,4L)
String
Functions
• ENDSWITH
• STARTSWITH
• SUBSTRING
• EqualsIgnoreCase
• UPPER
• LOWER
• REPLACE
• TRIM, RTRIM,
LTRIM
ENDSWITH ,
STARTSWITH
• ENDSWITH - This function accepts two String
parameters, it is used to verify whether the first string
ends with the second. string.
• STARTSWITH - This function accepts two string
parameters. It verifies whether the first string starts
with the second.
• emp.txt
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
city:chararray);
• grunt> emp_endswith = FOREACH emp_data
GENERATE (id,name),ENDSWITH ( name, 'n' );
• grunt> Dump emp_endswith;
• grunt> startswith_data = FOREACH emp_data GENERATE
(id,name), STARTSWITH (name,’Ro’);
• grunt> Dump startswith_data;
SUBSTRIN
G()
• This function returns a substring from the given
string.
• EMP.TXT
001,Robin,22,newyork
002,Stacy,25,Bhuwanesh
war 003,Kelly,22,Chennai
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int,
city:chararray);
• grunt> substring_data = FOREACH emp_data GENERATE
(id,name), SUBSTRING (name, 0, 2);
• grunt> Dump substring_data;
((1,Robin),Rob)
((2,Stacy),Sta)
((3,Kelly),Kel)
EqualsIgnoreCas
e()
• The EqualsIgnoreCase() function is used to compare
two strings and verify whether they are equal. If both are
equal this function returns the Boolean value true else it
returns the value false.
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING PigStorage(',') as
(id:int,
name:chararray, age:int, city:chararray);
• grunt> equals_data = FOREACH emp_data GENERATE (id,name),
EqualsIgnoreCase(name, 'Robin');
• grunt> Dump equals_data;
• ((1,Robin),tru
e)
((2,BOB),false)
((3,Maya),false)
((4,Sara),false)
((5,David),false)
((6,Maggy),fals
e)
((7,Robert),fals
e)
((8,Syam),false)
((9,Mary),false)
((10,Saran),fals
e)
UPPER(),
LOWER()
• UPPER- This function is used to convert all the
characters in a string to uppercase.
• LOWER- This function is used to convert all the
characters in a string to lowercase.
• grunt> emp_data = LOAD '/home/cloudera/emp.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int,
city:chararray);
• grunt> upper_data = FOREACH emp_data GENERATE
(id,name), UPPER(name);
• grunt> Dump upper_data;
• grunt> lower_data = FOREACH emp_data GENERATE
(id,name), LOWER(name);
• grunt> Dump lower_data;
REPLAC
E()
• This function is used to replace all the characters in a
given string with the new characters.
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
• grunt> replace_data = FOREACH emp_data GENERATE
(id,city),REPLACE(city,'Bhuwaneshwar','Bhuw');
• grunt> Dump replace_data;
((1,newyork),newyor
k)
((2,Kolkata),Kolkata)
((3,Tokyo),Tokyo)
((4,London),London)
((5,Bhuwaneshwar),Bhu
w)
((6,Chennai),Chennai)
((7,newyork),newyork)
((8,Kolkata),Kolkata)
((9,Tokyo),Tokyo)
((10,London),London)
((11,Bhuwaneshwar),Bhu
w)
((12,Chennai),Chennai)
TRIM(), RTRIM(),
LTRIM()
• The TRIM() function accepts a string and returns its copy after
removing the unwanted spaces before and after it.
• The function LTRIM() is same as the function TRIM(). It removes
the unwanted spaces from the left side of the given string
(heading spaces).
• The function RTRIM() is same as the function TRIM(). It
removes the unwanted spaces from the right side of a given
string (tailing spaces).
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
• grunt> trim_data = FOREACH emp_data GENERATE (id,name),
TRIM(name);
• grunt> ltrim_data = FOREACH emp_data GENERATE (id,name),
LTRIM(name);
• grunt> rtrim_data = FOREACH emp_data GENERATE (id,name),
RTRIM(name);
• grunt> Dump trim_data;
• grunt> Dump ltrim_data;
Date-time
Functions
• ToDate()
• GetDay()
• GetMonth(
)
• GetYear()
ToDate
()
• This function is used to generate a DateTime object
according to the given parameters.
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
• grunt> date_data = LOAD ‘/home/cloudera/date.txt'
USING PigStorage(',') as (id:int,date:chararray);
• grunt> todate_data = foreach date_data generate
ToDate(date,'yyyy/MM/dd HH:mm:ss') as
(date_time:DateTime);
• grunt> Dump todate_data;
(1989-09-26T09:00:00.000+05:
30)
(1980-06-20T10:22:00.000+05:
30)
GetDay
()
• This function accepts a date-time object as a
parameter and returns the current day of the given
date-time object.
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
UDF’
S
User Defined
Functions
• Apache Pig provides extensive
support for User Defined
Functions (UDF’s).
• Using these UDF’s, we can define our own functions
and use them.
• The UDF support is provided in six programming
languages. Java, Jython, Python, JavaScript, Ruby
and Groovy.
Creating
UDF’S
• Open Eclipse and create a new project.
• Convert the newly created project into a Maven project.
• Copy the pom.xml. This file contains the Maven
dependencies for Apache Pig and Hadoop-core jar files.
Java
code
import java.io.IOException;
import
org.apache.pig.EvalFunc;
import
org.apache.pig.data.Tuple;
import java.io.IOException;
import
org.apache.pig.EvalFunc;
import
org.apache.pig.data.Tuple;
public class Sample_Eval extends
EvalFunc<String>{ public String exec(Tuple
input) throws IOException {
if (input == null || input.size() == 0)
Registering the Jar
file
• grunt> REGISTER '/home/cloudera/sample_udf.jar';
• grunt> DEFINE Sample_Eval sample_eval();
• grunt> emp_data = LOAD '/home/cloudera/pigdata.txt'
USING PigStorage(',') as (id:int, name:chararray,
age:int, city:chararray);
• grunt> Upper_case = FOREACH emp_data
GENERATE sample_eval(name);

More Related Content

Similar to pig intro.pdf

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principlesPerl Careers
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 

Similar to pig intro.pdf (20)

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 

More from ssuser2d043c

sfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.ppt
sfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.pptsfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.ppt
sfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.pptssuser2d043c
 
ch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytgh
ch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytghch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytgh
ch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytghssuser2d043c
 
cocomo-220726173706-141e0dsdsd8f0 (1).pdf
cocomo-220726173706-141e0dsdsd8f0 (1).pdfcocomo-220726173706-141e0dsdsd8f0 (1).pdf
cocomo-220726173706-141e0dsdsd8f0 (1).pdfssuser2d043c
 
pointer in c through addressing modes esntial in c
pointer in c through addressing modes esntial in cpointer in c through addressing modes esntial in c
pointer in c through addressing modes esntial in cssuser2d043c
 
System engineering is related to software engineering
System engineering is related to software engineeringSystem engineering is related to software engineering
System engineering is related to software engineeringssuser2d043c
 
Session 01 (Introduction).pdf
Session 01 (Introduction).pdfSession 01 (Introduction).pdf
Session 01 (Introduction).pdfssuser2d043c
 
STATE DIAGRAM.pptx
STATE DIAGRAM.pptxSTATE DIAGRAM.pptx
STATE DIAGRAM.pptxssuser2d043c
 

More from ssuser2d043c (12)

sfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.ppt
sfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.pptsfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.ppt
sfdgdfgfgfdgvsdfdsfedrfewsfdsfsfterfdcm.ppt
 
ch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytgh
ch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytghch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytgh
ch11lect1.pptghjgjhjkkljkkkjkjkjljkjhytytgh
 
cocomo-220726173706-141e0dsdsd8f0 (1).pdf
cocomo-220726173706-141e0dsdsd8f0 (1).pdfcocomo-220726173706-141e0dsdsd8f0 (1).pdf
cocomo-220726173706-141e0dsdsd8f0 (1).pdf
 
pointer in c through addressing modes esntial in c
pointer in c through addressing modes esntial in cpointer in c through addressing modes esntial in c
pointer in c through addressing modes esntial in c
 
System engineering is related to software engineering
System engineering is related to software engineeringSystem engineering is related to software engineering
System engineering is related to software engineering
 
1_Overview.pdf
1_Overview.pdf1_Overview.pdf
1_Overview.pdf
 
software
softwaresoftware
software
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
 
Session 01 (Introduction).pdf
Session 01 (Introduction).pdfSession 01 (Introduction).pdf
Session 01 (Introduction).pdf
 
data 1.ppt
data 1.pptdata 1.ppt
data 1.ppt
 
IntroToOOP.ppt
IntroToOOP.pptIntroToOOP.ppt
IntroToOOP.ppt
 
STATE DIAGRAM.pptx
STATE DIAGRAM.pptxSTATE DIAGRAM.pptx
STATE DIAGRAM.pptx
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

pig intro.pdf

  • 1.
  • 2. What is Pig? • Apache Pig is an abstraction over MapReduce. • It is a tool/platform which is used to analyze larger sets of data representing them as data flows. • Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. • To write data analysis programs, Pig provides a high-level language known as Pig Latin. • This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.
  • 4. • To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. • All these scripts are internally converted to Map and Reduce tasks. • Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 5. Features of Pig • Rich set of operators: It provides many operators to perform operations like join, sort, filer, etc. • Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. • UDF’s: Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. • Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
  • 6. Apache Pig Vs Hive • Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar way Apache Pig does.
  • 7. Pig Latin – Data Model
  • 8.
  • 9. Pig Execution Modes • You can run Apache Pig in two modes. • Local Mode – In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose. • MapReduce Mode – MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
  • 10. Invoking the Grunt Shell • Local Mode • $ pig –x local • MapReduce mode • $ pig -x mapreduce (or) pig
  • 11. Execution Mechanisms • Interactive Mode (Grunt shell) – You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator). • Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension. • Embedded Mode (UDF) – Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.
  • 12. • Interactive Mode: grunt> customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(','); grunt> dump customers; • Batch Mode (Local): [cloudera@quickstart ~]$ cat pig_samplescript_local.pig customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int); dump customers; [cloudera@quickstart ~]$ pig -x local pig_samplescript_local.pig
  • 13. • Batch Mode (HDFS): [cloudera@quickstart ~]$ cat pig_samplescript_global.pig customers= LOAD '/training/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int); dump customers; [cloudera@quickstart ~]$ pig -x mapreduce pig_samplescript_global.pig
  • 14. Diagnostic Operators • The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. • Pig Latin provides four different types of diagnostic operators: – Dump operator – Describe operator – Explanation operator
  • 15. • Dump operator • The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose. grunt> customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int); grunt> dump customers;
  • 16. • Describe operator • The describe operator is used to view the schema of a relation/bag. grunt> customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int); grunt> describe customers; customers: {id: int,name: chararray,age: int,address: chararray,salary: int}
  • 17. • Explain operator • The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation/bag. grunt> customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int); grunt> explain customers;
  • 18. • Illustrate operator • The illustrate operator is used to display the logical, physical, and MapReduce execution plans of a relation/bag. grunt> customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int); grunt> illustrate customers;
  • 20. Group Operator • The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. • grunt> student_details = LOAD '/home/cloudera/students.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • grunt> student_groupdata = GROUP student_details by age;
  • 21. • grunt> dump student_groupdata; (21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) (22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)}) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)} ) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram)}) • grunt> describe student_groupdata; student_groupdata: {group: int,student_details: {(id: int,firstname: chararray,lastname: chararray,age: int,phone: chararray,city: chararray)}} • grunt> Illustrate student_groupdata;
  • 22. Grouping by Multiple Columns • grunt> student_details = LOAD '/home/cloudera/students.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • grunt> student_multiplegroup = GROUP student_details by (age, city);
  • 23. • grunt> dump student_multiplegroup; ((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)}) ((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) ((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)}) ((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)} ) ((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)}) ((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)} ) ((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)}) ((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
  • 24. Join Operator • The JOIN operator is used to combine records from two or more relations. • Types of Joins: – Self-join – Inner-join – Outer join : left join, right join, full join
  • 25. Self Join • customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • orders = LOAD '/home/local/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int); • grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> customers2 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> customers3 = JOIN customers1 BY id, customers2 BY id; • grunt> Dump customers3;
  • 26. Inner Join (equijoin) • grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int); • grunt> customer_orders = JOIN customers BY id, orders BY customer_id; • grunt> dump customer_orders;
  • 27. Left Outer Join • The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation. • grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int); • grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; • grunt> Dump outer_left;
  • 28. Right Outer Join • The right outer join operation returns all rows from the right table, even if there are no matches in the left table. • grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int); • grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; • grunt> Dump outer_right;
  • 29. Full Outer Join • The full outer join operation returns rows when there is a match in one of the relations. • grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int); • grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; • grunt> Dump outer_full;
  • 30. Cross Operator • grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); • grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int); • grunt> cross_data = CROSS customers, orders; • grunt> Dump cross_data;
  • 32. Union Operator • The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.
  • 33. • grunt> student1 = LOAD '/home/cloudera/student_data1.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); • grunt> student2 = LOAD '/home/cloudera/student_data2.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); • grunt> student = UNION student1, student2; • grunt> dump student;
  • 34. Split Operator • he SPLIT operator is used to split a relation into two or more relations.
  • 35. • grunt> student_details = LOAD '/home/cloudera/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • Let us now split the relation into two, one listing the students age less than 23, and the other listing the students having the age between 23 and 25. • SPLIT student_details into student_details1 if age<23, student_details2 if (age>23 and age<25); • grunt> Dump student_details1; • grunt> Dump student_details2;
  • 37. Filter Operator • The FILTER operator is used to select the required tuples from a relation based on a condition. • grunt> student_details = LOAD '/home/cloudera/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • grunt> filter_data = FILTER student_details BY city == 'Chennai'; • grunt> dump filter_data;
  • 38. Distinct Operator • The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation. • grunt> student_details = LOAD '/home/cloudera/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • grunt> distinct_data = DISTINCT student_details; • grunt> dump distinct_data;
  • 39. Foreach Operator • The FOREACH operator is used to generate specified data transformations based on the column data. grunt> student_details = LOAD ‘/home/cloudera/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray); • get the id, age, and city values of each student from the relation student_details and store it into another relation named foreach_data using the foreach operator.
  • 41. Order By • The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields. • grunt> student_details = LOAD '/home/cloudera/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • grunt> order_by_data = ORDER student_details BY age DESC;
  • 42. Limit Operator • grunt> student_details = LOAD '/home/cloudera/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); • grunt> limit_data = LIMIT student_details 4;
  • 43. Pig Latin Built-In Functions • Eval Functions • String Functions • Date-time Functions • Math Functions
  • 44. Eval Functions • Avg() • CONCAT() • COUNT() • COUNT_STA R() • DIFF() • MAX() • MIN() • SIZE() • SUBTRACT() • SUM()
  • 45. AVG () • Computes the average of the numeric values in a single-column bag. • grunt> A = LOAD '/home/cloudera/student.txt' USING PigStorage(',') as (name:chararray, term:chararray, gpa:float); • grunt> DUMP A; (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) • grunt> B = GROUP A BY name; • grunt> DUMP B; (John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)}) (Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)}) • grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa); • grunt> DUMP C; ({(John),(John),(John),(John)},3.850000023841858) ({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
  • 46. CONCA T() • Concatenates two expressions of identical type. • grunt>A = LOAD ‘/home/Cloudera/data.txt' as (f1:chararray, f2:chararray, f3:chararray); • grunt>DUMP A; (apache,open,sourc e) (hadoop,map,reduc e) (pig,pig,latin) • grunt>X = FOREACH A GENERATE CONCAT(f2,f3); • grunt>DUMP X; (opensourc e) (mapreduce
  • 47. COUN T • Computes the number of elements in a bag. • Note: You cannot use the tuple designator (*) with COUNT; that is, COUNT(*) will not work.
  • 48. • grunt>A = LOAD '/home/cloudera/c.txt' USING PigStorage(',') as (f1:int, f2:int, f3:int); • grunt>DUMP A; 1,2,3 4,2,nul l 8,3,4 4,3,nul l 7,5,nul l 8,4,3 • grunt>B = GROUP A BY f1; • grunt>DUMP B; (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) • grunt>X = FOREACH B GENERATE COUNT(A); • grunt>DUMP X; (1L) (2L) (1L)
  • 49. COUNT_ST AR • Computes the number of elements in a bag. • COUNT_STAR includes NULL values in the count computation (unlike COUNT, which ignores NULL values). • Example • In this example COUNT_STAR is used the count the tuples in a bag. • grunt>X = FOREACH B GENERATE COUNT_STAR(A);
  • 50. DIF F • Compares two fields in a tuple. • grunt> A = LOAD ‘/home/Cloudera/data.txt' AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:i nt)}); • grunt> DUMP A; ({(8,9),(0,1)},{(8,9),(1,1 )}) ({(2,3),(4,5)},{(2,3),(4,5 )}) ({(6,7),(3,7)},{(2,2),(3,7 )}) • grunt> DESCRIBE A; a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}} • grunt> X = FOREACH A DIFF(B1,B2); • grunt> dump X;
  • 51. MA X • Computes the maximum of the numeric values or chararrays in a single-column bag. MAX requires a preceding GROUP ALL statement for global maximums and a GROUP BY statement for group maximums. • Example – In this example the maximum GPA for all terms is computed for each student (see the GROUP operator for information about the field names in relation B).
  • 52. • grunt> A = LOAD ‘home/Cloudera/student.txt' AS (name:chararray, session:chararray, gpa:float); • grunt> DUMP A; (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) • grunt> B = GROUP A BY name; • grunt> DUMP B; (John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3 .8F)}) (Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm, 4.0F)})
  • 53. MI N • Computes the minimum of the numeric values or chararrays in a single-column bag. MIN requires a preceding GROUP… ALL statement for global minimums and a GROUP … BY statement for group minimums. • Example – In this example the minimum GPA for all terms is computed for each student (see the GROUP operator for information about the field names in relation B).
  • 54. • grunt> A = LOAD ‘/home/Cloudera/student.txt' AS (name:chararray, session:chararray, gpa:float); • grunt> DUMP A; (John,fl,3.9F) (John,wt,3.7F) (John,sp,4.0F) (John,sm,3.8F) (Mary,fl,3.8F) (Mary,wt,3.9F) (Mary,sp,4.0F) (Mary,sm,4.0F) • grunt> B = GROUP A BY name; • grunt> DUMP B; (John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3 .8F)}) (Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm, 4.0F)}) • grunt> X = FOREACH B GENERATE group, MIN(A.gpa); • grunt> DUMP X; (John,3.7F) (Mary,3.8F)
  • 55. SIZ E • Computes the number of elements based on any Pig data type. • Example • In this example the number of characters in the first field is computed. • grunt> A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray); (apache,open,sourc e) (hadoop,map,reduc e) (pig,pig,latin) • grunt> X = FOREACH A GENERATE SIZE(f1); • grunt> DUMP X; (6L) (6L) (3L)
  • 56. SU M • Computes the sum of the numeric values in a single-column bag. SUM requires a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums. • Example • In this example the number of pets is computed.
  • 57. • grunt> A = LOAD ‘/home/Cloudera/data' AS (owner:chararray, pet_type:chararray, pet_num:int); • grunt> DUMP A; (Alice,turtle,1) (Alice,goldfish,5 ) (Alice,cat,2) (Bob,dog,2) (Bob,cat,2) • grunt> B = GROUP A BY owner; • grunt> DUMP B; (Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat, 2)}) (Bob,{(Bob,dog,2),(Bob,cat,2)}) • grunt> X = FOREACH B GENERATE group, SUM(A.pet_num); • DUMP X; (Alice,8L ) (Bob,4L)
  • 58. String Functions • ENDSWITH • STARTSWITH • SUBSTRING • EqualsIgnoreCase • UPPER • LOWER • REPLACE • TRIM, RTRIM, LTRIM
  • 59. ENDSWITH , STARTSWITH • ENDSWITH - This function accepts two String parameters, it is used to verify whether the first string ends with the second. string. • STARTSWITH - This function accepts two string parameters. It verifies whether the first string starts with the second. • emp.txt
  • 60. • grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray); • grunt> emp_endswith = FOREACH emp_data GENERATE (id,name),ENDSWITH ( name, 'n' ); • grunt> Dump emp_endswith; • grunt> startswith_data = FOREACH emp_data GENERATE (id,name), STARTSWITH (name,’Ro’); • grunt> Dump startswith_data;
  • 61. SUBSTRIN G() • This function returns a substring from the given string. • EMP.TXT 001,Robin,22,newyork 002,Stacy,25,Bhuwanesh war 003,Kelly,22,Chennai
  • 62. • grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING PigStorage(',')as (id:int, name:chararray, age:int, city:chararray); • grunt> substring_data = FOREACH emp_data GENERATE (id,name), SUBSTRING (name, 0, 2); • grunt> Dump substring_data; ((1,Robin),Rob) ((2,Stacy),Sta) ((3,Kelly),Kel)
  • 63. EqualsIgnoreCas e() • The EqualsIgnoreCase() function is used to compare two strings and verify whether they are equal. If both are equal this function returns the Boolean value true else it returns the value false.
  • 64. • grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray); • grunt> equals_data = FOREACH emp_data GENERATE (id,name), EqualsIgnoreCase(name, 'Robin'); • grunt> Dump equals_data; • ((1,Robin),tru e) ((2,BOB),false) ((3,Maya),false) ((4,Sara),false) ((5,David),false) ((6,Maggy),fals e) ((7,Robert),fals e) ((8,Syam),false) ((9,Mary),false) ((10,Saran),fals e)
  • 65. UPPER(), LOWER() • UPPER- This function is used to convert all the characters in a string to uppercase. • LOWER- This function is used to convert all the characters in a string to lowercase.
  • 66. • grunt> emp_data = LOAD '/home/cloudera/emp.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray); • grunt> upper_data = FOREACH emp_data GENERATE (id,name), UPPER(name); • grunt> Dump upper_data; • grunt> lower_data = FOREACH emp_data GENERATE (id,name), LOWER(name); • grunt> Dump lower_data;
  • 67. REPLAC E() • This function is used to replace all the characters in a given string with the new characters.
  • 68. • grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray); • grunt> replace_data = FOREACH emp_data GENERATE (id,city),REPLACE(city,'Bhuwaneshwar','Bhuw'); • grunt> Dump replace_data; ((1,newyork),newyor k) ((2,Kolkata),Kolkata) ((3,Tokyo),Tokyo) ((4,London),London) ((5,Bhuwaneshwar),Bhu w) ((6,Chennai),Chennai) ((7,newyork),newyork) ((8,Kolkata),Kolkata) ((9,Tokyo),Tokyo) ((10,London),London) ((11,Bhuwaneshwar),Bhu w) ((12,Chennai),Chennai)
  • 69. TRIM(), RTRIM(), LTRIM() • The TRIM() function accepts a string and returns its copy after removing the unwanted spaces before and after it. • The function LTRIM() is same as the function TRIM(). It removes the unwanted spaces from the left side of the given string (heading spaces). • The function RTRIM() is same as the function TRIM(). It removes the unwanted spaces from the right side of a given string (tailing spaces).
  • 70. • grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray); • grunt> trim_data = FOREACH emp_data GENERATE (id,name), TRIM(name); • grunt> ltrim_data = FOREACH emp_data GENERATE (id,name), LTRIM(name); • grunt> rtrim_data = FOREACH emp_data GENERATE (id,name), RTRIM(name); • grunt> Dump trim_data; • grunt> Dump ltrim_data;
  • 72. ToDate () • This function is used to generate a DateTime object according to the given parameters. • date.txt 001,1989/09/26 09:00:00 002,1980/06/20 10:22:00 003,1990/12/19 03:11:44
  • 73. • grunt> date_data = LOAD ‘/home/cloudera/date.txt' USING PigStorage(',') as (id:int,date:chararray); • grunt> todate_data = foreach date_data generate ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime); • grunt> Dump todate_data; (1989-09-26T09:00:00.000+05: 30) (1980-06-20T10:22:00.000+05: 30)
  • 74. GetDay () • This function accepts a date-time object as a parameter and returns the current day of the given date-time object. • date.txt 001,1989/09/26 09:00:00 002,1980/06/20 10:22:00 003,1990/12/19 03:11:44
  • 76. User Defined Functions • Apache Pig provides extensive support for User Defined Functions (UDF’s). • Using these UDF’s, we can define our own functions and use them. • The UDF support is provided in six programming languages. Java, Jython, Python, JavaScript, Ruby and Groovy.
  • 77. Creating UDF’S • Open Eclipse and create a new project. • Convert the newly created project into a Maven project. • Copy the pom.xml. This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.
  • 78. Java code import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class Sample_Eval extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0)
  • 79. Registering the Jar file • grunt> REGISTER '/home/cloudera/sample_udf.jar'; • grunt> DEFINE Sample_Eval sample_eval(); • grunt> emp_data = LOAD '/home/cloudera/pigdata.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray); • grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);