Apache Pig Relational Operators - II

PIG
Relational Operators - II
Foreach, Filter, Join, Co-
Group, Union

Relational operator: foreach
 foreach the name itself describes for each record do
something. It is similar to For-Loop for specifying the
iteration that is executed repeatedly.
 Example: select few columns
grunt> a =foreach dataTransaction Generate $0,$1,$2 ;
It can also be used for various arithmetic operations such as
grunt> A= FOREACH dataTransaction Generate $0,($3+$4)
as S;
or
grunt> a =foreach dataTransaction Generate $0,
(TransAmt1+TransaAmt2) as S;
Rupak Roy

grunt > B= FOREACH A GENERATE $1/100;
or
grunt> b = foreach A GENERATE ($1/100) as D
C= FOREACH B GENERATE ( (D >50)?’above’ :
‘below’);
or
C= foreach B generate ( (D==50)?’Equal’ :
((D>50)?’above’:’down’));
Rupak Roy

Relational Operators: filter
 It is used to select the required tuple based on conditions.
 Or simply we can say filter helps to remove unwanted data/records based
on requirements.
Example such as:
grunt> F = Filter dataTransaction by TransAmt1 > 500;
Or
grunt> F1 = filter dataTransaction by (($4+$5)/100) > 2 ;
Or
grunt> F2 = filter dataTransaction by $6 == ‘Nunavut’;
Or
grunt> F3 = filter data Transaction by $1 MATCHES ‘ Car.*’;
#it will give all the names that starts with CA….
Or
grunt> F4 = filter dataTRansaction by NOT $1 MATCHES ‘Car.*’;
#it will give all the names that doesnot starts with CA
Rupak Roy

Relational Operators: filter
Or
grunt>F5 = filter dataTransaction by CustomerName MATCHES ‘Ca.*s’;
#it will filter the records based on names starting with ‘Ca’ and ends with
‘s ’ . To represent any number of characters we use * and in this case we
want any number of characters before ‘s’but after Ca
Or
grunt> F5 = filter dataTransaction by CustomerName MATCHES
‘ .*(nica|los) .* ‘
#now here the dot start ( .* ) means it can have any number of characters
before and after .*(nica or los) .*
nica = MONICA Federle
los = Carlos Daly
Rupak Roy

Relational operators: Join
 Join Operator is used when we have to combine
two or more datasets.
 Joining the two or more datasets is done based
on a common key from the datasets.
 Joins can be of 3 types
1. Self-join
2. Inner-join
3. Outer-join – left join, right join and full join
Rupak Roy

Self – join
 Self join is used for joining a table itself.
Let’s understand this with the help of an example:
#Load the same dataset under different Alias name:
grunt> join1= LOAD ‘/home/hduser/datasets/join1.csv’
using PigStorage(‘,’) as ( CustomerNAme:chararray,
Transaction_ID:bytearray, ProductName: chararray);
grunt> join11= LOAD
‘/home/hduser/datasets/join1.csv’ using PigStorage(‘,’)
as ( CustomerNAme:chararray,
Transaction_ID:bytearray, ProductName: chararray);
Rupak Roy

#perform Self-join using JOIN operator
grunt> selfjoin = JOIN join1 by Transaction_ID, join11
by Transaction_ID;
grunt> dump selfjoin;
Rupak Roy

Inner-join
 Is also known as equijoin.
 Inner join returns rows when there is a match in both
tables based on a common key or a value.
#Load data2
grunt> join2= LOAD ‘/home/hduser/datasets/join2.csv’
using PigStorage(‘,’) as ( CustomerNAme:chararray,
Transaction_ID:bytearray, Department: chararray);
grunt> innerjoin = JOIN join1 by Transaction_ID, join2 by
Transaction_ID;
grunt> dump innerjoin;
Rupak Roy

Outer Join
 Left Outer Join returns all rows
from the left table, even if there is no
match in the right table and
it will take only the values from the right table that matches
with the left table.
grunt> leftouter = JOIN join1 by Transaction_ID LEFT OUTER, join2 BY Transaction_ID;
 Right Outer Join: is the opposite of Left Outer Join. It returns all
the rows from the right table even if there are no matches in
the left table and it will take only the values from the left table
that matches with the
right table
grunt> rightouter =JOIN join1 by Transaction_ID
RIGHT OUTER ,
join2 by Transaction_ID;
Rupak Roy

Outer Join
 Full Outer Join: returns all the rows from
both the tables when there is a match in
one of the relations.
grunt> fullouter = JOIN join1 by
Transaction_ID FULL OUTER, join2 BY
Transaction_ID;
Rupak Roy

Joins are one of the important operators
Rupak Roy

CO-Group: which essentially performs a join and
a group at the same time.
COGROUP on multiple datasets results in a record
with a key dataset.
To perform COGROUP type:
grunt> COGROUP join1 on Transaction_ID, join2 on
Transaction_ID;
Rupak Roy

Relational Operator: UNION
 Is to merge the contents of two and more datasets.
grunt> U = UNION join1, join2;
dump U;
What if we want to merge two datasets that has different schemas exampe:
join1= LOAD ‘/home/hduser/datasets/join1.csv’ using PigStorage(‘,’) as
( CustomerNAme:chararray, Transaction_ID:chararray, Department: chararray);
join1u= LOAD ‘/home/hduser/datasets/join1.csv’ using PigStorage(‘,’) as
( CustomerNAme:chararray, Transaction_ID:int, Department: chararray);
join2= LOAD ‘/home/hduser/datasets/join2.csv’ using PigStorage(‘,’) as
( CustomerNAme:chararray, Transaction_ID:chararray, Department: chararray);
Unioned= UNION join1u,join2 ;
Describe Unioned; it will through an error ‘cannot cast to byte array ‘ due to different data
types of transaction ID.
Rupak Roy

 It will be very tedious and time consuming to go
back and forth and load the data to change the
schema. We can also explicitly define the schema
while using relational queries without disturbing the
original schema.
grunt> joinM= FOREACH join2 generate $0,(int)$1,$2;
unioned = UNION joinM, join1u;
describe unioned;
Alternatively to perform UNION for incompatible
data type using ONSCHEMA;
grunt>U= UNION ONSCHEMA join1u, join2;
Rupak Roy

Relational Operator: RANK
 Returns rank to each tuple with a relation;
Example:
grunt> vi names
Zara,1,F
David,2,F
David,2,T
Alan,2,M
Calvin,3,M
Alan,5,M
Chris,8,M
Ellie ,7,F
Bob,8,M
Carlos,2,M
Then press ‘ ESC’ key then type ‘ :wq! ‘ to save
grunt> names = load ‘/home/hduser/datasets/names’ using PigStorage (‘,’) as
( n1:charrray,n2:int,n3:chararray);
grunt> DUMP names;
Rupak Roy

grunt> ranked = RANK names;
grunt> dump ranked;
(1, Zara,1,F)
(2, David,2,F)
(3 David,2,T)
(4 Alan,2,M)
(5, Calvin,2,M)
(6, Alan,5,M)
(7, Chris,8,M)
(8, Ellie ,7,F)
(9, Bob,8,M)
(10,Carlos,2,F)
We can also implement rank using two fields, each one with
different sorting order.
grunt> ranked2 = RANK names by N1 ASC, N2 DESC;
grunt> dump ranked2;
Rupak Roy

 Sometimes we might encounter the RANK has been
assigned to 2 fields or 2 records with a same rank.
 To overcome the issue we have a small function call
DENSE
grunt> rankedG = RANK names by N1 DESC, N2 ASC DENSE;
(1,Zara,1,F)
(2,Elie,7,F)
(3,David,2,F)
(3,David,2,T)
(4,Chris,8,M)
(5,Carlos,2,F)
(6,Calvin,3,M)
(7,bob,8,M)
(8,Alan,2,M)
(9,Alan,5,M)
Rupak Roy

Next
 We will learn UDF (User Define Function).
Rupak Roy

Apache Pig Relational Operators - II

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Apache Pig Relational Operators - II

Similar to Apache Pig Relational Operators - II (20)

More from Rupak Roy

More from Rupak Roy (20)

Recently uploaded

Recently uploaded (20)

Apache Pig Relational Operators - II