Apache Pig
Making data transformation easy
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
➢ Need to solve complex problem
➢ More complex atomic operations than M/R
➢ Java is not a data oriented language → Low
productivity
➢ Any solutions?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig to the rescue!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Hadoop
public class DeliveryFileMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text>{
private String cellNumber,deliveryCode,fileTag="DR~";
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter) throws
IOException
{
String line = value.toString();
String splitarray[] = line.split(",");
cellNumber = splitarray[0].trim();
deliveryCode = splitarray[1].trim();
output.collect(new Text(cellNumber), new Text
(fileTag+deliveryCode));
}
}
** Extracted from http://kickstarthadoop.blogspot.com.
es/2011/09/joins-with-plain-map-reduce.html
public class SmsReducer extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private String customerName,deliveryReport;
private static Map<String,String> DeliveryCodesMap= new
HashMap<String,String>();
public void configure(JobConf job){
loadDeliveryStatusCodes();
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException{
while (values.hasNext()){
String currValue = values.next().toString();
String valueSplitted[] = currValue.split("~");
if(valueSplitted[0].equals("CD"))
customerName=valueSplitted[1].trim();
else if(valueSplitted[0].equals("DR"))
deliveryReport = DeliveryCodesMap.get
(valueSplitted[1].trim());
}
if(customerName!=null && deliveryReport!=null)
output.collect(new Text(customerName), new Text
(deliveryReport));
else if(customerName==null)
output.collect(new Text("customerName"), new Text
(deliveryReport));
else if(deliveryReport==null)
output.collect(new Text(customerName), new Text
("deliveryReport"));
}
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
A = JOIN A BY keyA, B BY keyB;
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig overview
➢ Framework layer over HDFS and Hadoop
➢ Developed by Yahoo at 2006
➢ Users: Yahoo, Linkedin, Twitter, IBM, etc.
➢ Last major release: 0.14.0 (November 2014)
http://pig.apache.org/
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Hadoop vs. Apache Pig
➢ M/R as atomic
operations
➢ Java is not data
oriented
➢ M/R inner flexibility
➢ Efficiency
➢ ETL operations: Join,
Filter, Group, etc.
➢ Pig Latin: Data
scripting language
➢ UDF with Java (and
others)
➢ Transform to M/R
overhead
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Pig Programming Model: Data
➢ Pig operations operate on relations
➢ A relation is a bag
➢ A bag is a collection of tuples
➢ A tuple is an ordered set of fields
➢ A field is any type of data
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Basic data types:
○ Boolean: True, False
○ Int and Long: 1, 2, 3, 4, 5
○ Float and Double: 2.3, 1.4, 4.5
○ Chararray: ‘Hello’, ‘I am a string’
○ DateTime: 2014-09-11T12:20:14.1234+00:00
○ … more but you won’t probably use them very often
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Tuple: A catch-all data type
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
➢ And relations? Just the most outer
(distributed) bags
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Loading data? No, first let’s meet our friend
Grunt
➢ Interactive pig shell → Nice for
debugging/experimenting
➢ pig -x local or pig -x mapred
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)
➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Students = LOAD ‘student_path’ USING PigStorage( ‘t’, ‘-noschema’ ) AS
(student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int);
Relation
Name
Path to
HD/HDFS
Connector Field
separator
Tuple
schema
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)
➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ );
Relation
Name
Path to
HD/HDFS
Connector Field
separator
Load schema from
.pig_schema
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DUMP instruction:
○ Prints the content of a relation at standard output
DUMP Students;
(1,John,Doe,M,18)
(2,Mary,Doe,F,20)
(3,Lara,Croft,F,25)
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(6,Sarah,Kerrigan,F,21)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
(9,Princess,Peach,F,21)
(10,Peter,Parker,M,23)
grunt>
Relation
Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DESCRIBE instruction:
○ Prints the schema of the relation at standard output
DESCRIBE Students;
Students: {student_id: long,name: chararray,surname: chararray,gender:
chararray,age: int}
grunt>
Relation
Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ ILLUSTRATE instruction:
○ Prints the schema of the relation and a tuple example
at standard output
ILLUSTRATE Students;
----------------------------------------------------------------------------
---------------------------------------
| Students | student_id:long | name:chararray | surname:chararray |
gender:chararray | age:int |
----------------------------------------------------------------------------
---------------------------------------
| | 9 | Princess | Peach |
F | 21 |
----------------------------------------------------------------------------
---------------------------------------
grunt>
Relation
Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, name, age;
Relation
Name
Base
relation
Projected
data
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name,
surname) AS full_name, age;
Relation
Name
Base
relation
Projected
data
We can generate
new data too!!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Let us execute the instruction and… it seems that
nothing happens!
○ We had some tracing output with LOAD, DUMP, and
ILLUSTRATE…
○ Any ideas on this issue?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Pig employs lazy evaluation
➢ Computation only when:
○ LOAD, ILLUSTRATE, DUMP, STORE
➢ Pig keeps a DAG on MR jobs needed to
compute relations (optimized!)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Extend Student relation to add a field that
determines if the students is under 25 years
(1,John,Doe,M,18,true)
(2,Mary,Doe,F,20,true)
(3,Lara,Croft,F,25,false)
(4,Sherlock,Holmes,M,36,false)
(5,John,Watson,M,38,false)
(6,Sarah,Kerrigan,F,21,true)
...
Exercise: Who is under 25?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FILTER instruction:
○ Generate a new relation by filtering data on a relation
StudentsFilt= FILTER Students BY age > 24 AND age < 34;
DUMP StudentsFilt;
(3,Lara,Croft,F,25)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
Relation
Name
Base
relation
Condition to fulfill
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:
○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale
OTHERWISE;
DUMP StudentsMale;
(1,John,Doe,M,18)
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
(10,Peter,Parker,M,23)
Base
relation
New
relation
Condition to fulfill by
new relation.
Otherwise means the
rest
New
relation
Condition to fulfill by
new relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:
○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF
age<30, OtherStudents OTHERWISE;
DUMP OtherStudents;
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(8,Tony,Stark,M,33)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:
○ Creates tuples with the key and a of bag tuples with
the same key values
StudentsGr = GROUP Students BY gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2,
Mary,Doe,F,20)})
(M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John,
Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)})
DESCRIBE StudentsGr;
StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray,
surname: chararray,gender: chararray,age: int)}}
Base
relation
New
relation
Use these fields’
values to make groups
New
schema!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:
○ We can use multiple relations. Creates one bag per
relation
StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY
gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John,
Watson,M,38),(4,Sherlock,Holmes,M,36)})
DESCRIBE StudentsGr;
StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name:
chararray,surname: chararray,gender: chararray,age: int)},OtherStudents:
{(student_id: long,name: chararray,surname: chararray,gender: chararray,age:
int)}}
Base
relation
New
relation
Use these fields’
values to make groups
New
schema!
Base
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Nested FOREACH:
○ Operate on data in bags inside a relation and then
project
StudentsNested = FOREACH StudentsGr{
Information = FOREACH Students GENERATE name, surname;
GENERATE group AS gender, Information AS
student_information;
}
DUMP StudentsNested;
(F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)})
(M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock,
Holmes),(John,Doe)})
Base
relation
New
relation
Bag inside base
relation
Finally
project
New bag
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ (inner) JOIN instruction:
○ Our classic database operator for relations!
StudentsGrades= JOIN Students BY student_id, Grades BY
student_id;
DUMP StudentsGrades;
(1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5)
(1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6)
(2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7)
…
DESCRIBE StudentsGrades;
StudentsGrades: {Students::student_id: long,Students::name: chararray,
Students::surname: chararray,Students::gender: chararray,Students::age: int,
Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Base
relation 1
New
relation
Use these fields’
values to group
New
schema!
Base
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ (left) JOIN instruction:
○ Our classic database operator for relations!
Operating on relations
StudentsGrades= JOIN Students BY student_id LEFT, Grades BY
student_id;
DUMP StudentsGrades;
(6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5)
(7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5)
(8,Tony,Stark,M,33,8,Math,6.7)
…
DESCRIBE StudentsGrades;
StudentsGrades: {Students::student_id: long,Students::name: chararray,
Students::surname: chararray,Students::gender: chararray,Students::age: int,
Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Left
relation
New
relation
Do not forget this one!
New
schema!
Right
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CROSS instruction:
○ Cartesian product of two or more relations
Operating on relations
StudentsCr= CROSS Students, Grades;
DUMP StudentsCr;
(10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0)
(10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5)
(10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7)
(10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3)
…
DESCRIBE StudentsCr;
StudentsCr: {Students::student_id: long,Students::name: chararray,Students::
surname: chararray,Students::gender: chararray,Students::age: int,Grades::
student_id: long,Grades::course: chararray,Grades::mark: double}
Relation 1
New
relation
Relation 2
New
schema!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ UNION instruction:
○ Joins in the same relation multiple relations
Operating on relations
StudentsUnion= UNION Students, Grades;
DUMP StudentsUnion;
(1,John,Doe,M,18) (1,Math,5.6)
(2,Mary,Doe,F,20) (2,Math,8.9)
(3,Lara,Croft,F,25) (3,Math,7.1)
…
DESCRIBE StudentsUnion;
Schema for StudentsUnion unknown.
Relation 1
New
relation
Relation 2
Union does not
preserve schemas!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ DISTINCT instruction:
○ Only preserves unique tuples
Operating on relations
Courses= FOREACH Grades GENERATE course AS course;
UniqueCourses= DISTINCT Courses;
DUMP UniqueCourses;
(Math)
(Biology)
(Physics)
(Engineering)
New
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ORDER BY instruction:
○ Sorts relations by a specific criteria
Operating on relations
SortedGrades= ORDER Grades BY mark DESC;
DUMP SortedGrades;
(2,Biology,10.0)
(10,Engineering,10.0)
(10,Math,10.0)
(5,Biology,10.0)
(5,Engineering,9.0)
(7,Physics,8.9)
…
Base
relation
New
relation
field(s) used to sort
Sort criteria: DESC
(descendant) or ASC
(ascendant)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ LIMIT instruction:
○ Truncates relation’s size
Operating on relations
BestGrades= LIMIT SortedGrades 3;
DUMP BestGrades;
(10,Math,10.0)
(10,Engineering,10.0)
(2,Biology,10.0)
Base
relation
New
relation
Maximum number of
tuples
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:
○ Appends position of each tuple in the relation
Operating on relations
RankedGrades= RANK SortedGrades;
DUMP RankedGrades;
(1,2,Biology,10.0)
(2,10,Engineering,10.0)
(3,10,Math,10.0)
(4,5,Biology,10.0)
(5,5,Engineering,9.0)
…
DESCRIBE RankedGrades;
RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,
mark: double}
Base
relation
New
relation
Rank
number!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:
○ We can also sort and rank!
Operating on relations
RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC;
DUMP RankedGrades;
(1,1,Engineering,7.7)
(2,1,Math,5.6)
(3,1,Biology,4.5)
(4,1,Physics,2.3)
(5,2,Biology,10.0)
…
DESCRIBE RankedGrades;
RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,
mark: double}
Base
relation
New
relation
fields to
sort
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ SAMPLE instruction:
○ Sample the relation!
Operating on relations
SampledGrades= SAMPLE Grades 0.05;
DUMP SampledGrades;
(4,Engineering,8.0)
Base
relation
New
relation
proportion to sample
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Get the 3 top grades for each student
(1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)})
(2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)})
(3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)})
(4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)})
(5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)})
(6,{(,)})
...
Exercise: Top grades
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE instruction:
○ Is this really useful? Yes! Many aggregates with just
one operation
Operating on relations
CubedGrades= CUBE Grades BY CUBE(student_id,course);
CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark);
DUMP CubedGrades;
((,Math),7.188888888888889)
((,Biology),7.8)
((,Physics),5.375)
((,Engineering),6.877777777777778)
((,),6.729032258064516)
((2,Math),8.9)
((2,Biology),10.0)
((2,),8.075)
…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE/ROLLUP instruction:
○ Like standard CUBE but nulls values are introduced
from right to left
Operating on relations
RolledGrades= CUBE Grades BY ROLLUP(course,student_id);
RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube.
mark);
DUMP RolledGrades;
((Math,),7.188888888888889)
((Math,2),8.9)
((Math,3),7.1)
((Math,4),2.3)
((Math,5),6.7)
((Math,7),8.5)
((Math,8),6.7)
((Math,9),8.9)
…
order matters!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT instruction:
○ Assert that the whole relation fulfills a condition
○ Useful for debugging
Operating on relations
ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’;
Base
relation
Error
message
condition
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ STORE instruction:
○ Stores the relation into the local FS or HDFS (usually!)
○ Useful for debugging
Finally, storing data!
STORE BestGrades INTO ‘best_grades_path’ USING
PigStorage( ‘t’, ‘-noschema’ );
Relation
path to store
data
Connector Field
separator
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Problems solved?!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT
➢ GROUP
➢ CROSS
➢ CUBE
➢ DISTINCT
➢ FILTER
➢ FOREACH
➢ GROUP
Only these operations?
➢ JOIN
➢ LIMIT
➢ LOAD
➢ ORDER, RANK
➢ SAMPLE
➢ SPLIT
➢ UNION
➢ DUMP, ILLUSTRATE,
DESCRIBE
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Transform data in data projections
➢ Built-in functions:
○ math functions, string functions, datetime functions,
casting functions, etc.
➢ User defined functions:
○ Our own functions written in Java, Python, Ruby,
Javascript, etc.
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:
○ AVG/MAX/MIN/SUM: compute the
average/max/min/sum of a bag of numeric values
Functions & user defined functions
GradesGr = GROUP Grades BY course;
GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades.
mark) AS avg_mark;
DUMP GradesAvg;
(Math,7.188888888888889)
(Biology,7.8)
(Physics,5.375000000000001)
(Engineering,6.877777777777777)
Employ
only this
field in
bag/tuple
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:
○ COUNT: number of elements (not null) in a bag
Functions & user defined functions
GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT
(Grades) AS number_students;
DUMP GradesCount;
(Math,9)
(Biology,5)
(Physics,8)
(Engineering,9)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ FLATTEN: behavior depends on input
Functions & user defined functions
DUMP GradesCount;
(Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9),
(3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)})
(Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5),
(9,Biology,7.8)})
...
GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN
(Grades.mark) AS mark;
DUMP GradesFlat;
(Math,6.7)
(Math,5.6)
(Math,10.0)
…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ FLATTEN: behavior depends on input
Functions & user defined functions
GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course,
mark) AS tuple_mark;
DUMP GradesTuple
(1,(Math,5.6))
(2,(Math,8.9))
(3,(Math,7.1))
(4,(Math,2.3))
...
GradesUntupled= FOREACH GradesTuple GENERATE student_id AS
student_id, FLATTEN(tuple_mark);
DUMP GradesUntupled;
(1,Math,5.6)
(2,Math,8.9)
(3,Math,7.1)
(4,Math,2.3)
…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ SUBTRACT: Tuples on first bag not in the second
Functions & user defined functions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF
age<20, OtherStudents OTHERWISE;
StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20
BY gender;
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT(
StudentsUnder25, StudentsUnder20 );
DUMP StudentsSub;
(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})
(M,{(10,Peter,Parker,M,23)})
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ DIFF: Non overlapping tuples on two bags
Functions & user defined functions
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF
(StudentsUnder25, StudentsUnder20);
DUMP StudentsDiff;
(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})
(M,{(10,Peter,Parker,M,23)})
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Math functions:
○ Common math functions for numeric values:
■ ABS
■ EXP
■ FLOOR
■ LOG
■ RANDOM
■ ROUND
■ SQRT
■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ String functions:
○ Transform chararrays:
■ ENDSWITH
■ LOWER
■ UPPER
■ SUBSTRING
■ TRIM
■ REPLACE
■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Datetime functions:
○ Get information on dates and timestamps:
■ AddDuration
■ CurrentTime
■ ToDate
■ ToString
■ ToUnixTime
■ DaysBetween
■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class SHUFFLE extends EvalFunc<DataBag> {
@Override
public DataBag exec( Tuple input ) throws
IOException {
if ( input == null )
throw new IOException("Invalid input:
null");
if( input.size() != 1 )
throw new IOException("Expected one
argument");
if( input.get( 0 ) == null )
return null;
TupleFactory tf = TupleFactory.getInstance();
DataBag bag = (DataBag) input.get( 0 );
List<Tuple> l = new ArrayList<Tuple>();
for( Tuple t : bag )
l.add( t );
Collections.shuffle( l );
DataBag resBag = B BagFactory.getInstance().
newDefaultBag( l );
return resBag;
}
User defined functions
@Override
public Schema outputSchema( Schema input ) {
try {
return new Schema( input.getField( 0 ) );
} catch( Exception e ){
return null;
}
}
}
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Library of useful UDFs released 2010
➢ Created by LinkedIn engineering team:
○ Stats: variance, quantiles, median, etc.
○ Bags: concat, append, preped, etc.
○ Sampling
○ Page rank
○ Session estimation
➢ Last major release: 1.2.0 (Dec, 2013)
http://datafu.incubator.apache.org/
More functions: Datafu Pig
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
How to use UDF libraries
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat
(StudentsUnder25,StudentsUnder20);
DUMP StudentBagConcat
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)})
Indicate UDF to be included
and name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Scripting
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
Students= LOAD ‘$student_file’ USING PigStorage( ‘t’, ‘-noschema’ ) AS (
student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int)
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20,
OtherStudents OTHERWISE;
StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY
gender;
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat
(StudentsUnder25,StudentsUnder20);
STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘t’, ‘-schema’ );
A
s
d
a
Libraries and Udfs
Loaddata
TransformdataStoredata
parameter
parameter
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Calling a script
pig -x mapred -f myscript.pig -param student_file=students.csv -param
output=myoutput_path
parameter definition
execution mode script file
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Not limited to plain text
➢ Multiple supported format: Json, Avro,
Accumulo, etc.
➢ Connectors to data sources: MongoDb,
Cassandra, HBase, etc.
More on load/store
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Detect pairs of products bought together (e.g.,
chairs and tables)
➢ Goal: recommend related products
➢ Association score:
Exercise: Product association
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Purchases: purchases.tsv
➢ Products: products.tsv
Product association
product_id user_id price date
1 23 14.5 2014-03-03
4 15 11.2 2014-08-09
88 3 48.3 2011-01-01
...
product_id status
1 ok
5 ko
99 ok
...
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Time to work!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Clear and simple
syntax
➢ Interactive client
➢ Transparent M/R
jobs
➢ Integration with
Java and others
Final notes: Pros & cons
➢ Not as flexible as
Hadoop
➢ Oriented towards
ETL, not AI
➢ No loops
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://pig.apache.org/
➢ Programming pig. Alan Gates. Ed. O’Reilly
➢ StackOverflow
Extra information

Apache Pig: Making data transformation easy

  • 1.
    Apache Pig Making datatransformation easy Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015
  • 2.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Problem Solving Complex problem
  • 3.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Problem Solving Complex problem
  • 4.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Problem Solving ➢ Need to solve complex problem ➢ More complex atomic operations than M/R ➢ Java is not a data oriented language → Low productivity ➢ Any solutions?
  • 5.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Apache Pig to the rescue!
  • 6.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Join in Apache Hadoop public class DeliveryFileMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>{ private String cellNumber,deliveryCode,fileTag="DR~"; public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); String splitarray[] = line.split(","); cellNumber = splitarray[0].trim(); deliveryCode = splitarray[1].trim(); output.collect(new Text(cellNumber), new Text (fileTag+deliveryCode)); } } ** Extracted from http://kickstarthadoop.blogspot.com. es/2011/09/joins-with-plain-map-reduce.html public class SmsReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { private String customerName,deliveryReport; private static Map<String,String> DeliveryCodesMap= new HashMap<String,String>(); public void configure(JobConf job){ loadDeliveryStatusCodes(); } public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{ while (values.hasNext()){ String currValue = values.next().toString(); String valueSplitted[] = currValue.split("~"); if(valueSplitted[0].equals("CD")) customerName=valueSplitted[1].trim(); else if(valueSplitted[0].equals("DR")) deliveryReport = DeliveryCodesMap.get (valueSplitted[1].trim()); } if(customerName!=null && deliveryReport!=null) output.collect(new Text(customerName), new Text (deliveryReport)); else if(customerName==null) output.collect(new Text("customerName"), new Text (deliveryReport)); else if(deliveryReport==null) output.collect(new Text(customerName), new Text ("deliveryReport")); }
  • 7.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Join in Apache Pig
  • 8.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Join in Apache Pig A = JOIN A BY keyA, B BY keyB;
  • 9.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Apache Pig overview ➢ Framework layer over HDFS and Hadoop ➢ Developed by Yahoo at 2006 ➢ Users: Yahoo, Linkedin, Twitter, IBM, etc. ➢ Last major release: 0.14.0 (November 2014) http://pig.apache.org/
  • 10.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Apache Hadoop vs. Apache Pig ➢ M/R as atomic operations ➢ Java is not data oriented ➢ M/R inner flexibility ➢ Efficiency ➢ ETL operations: Join, Filter, Group, etc. ➢ Pig Latin: Data scripting language ➢ UDF with Java (and others) ➢ Transform to M/R overhead
  • 11.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Pig Programming Model: Data ➢ Pig operations operate on relations ➢ A relation is a bag ➢ A bag is a collection of tuples ➢ A tuple is an ordered set of fields ➢ A field is any type of data
  • 12.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Basic data types: ○ Boolean: True, False ○ Int and Long: 1, 2, 3, 4, 5 ○ Float and Double: 2.3, 1.4, 4.5 ○ Chararray: ‘Hello’, ‘I am a string’ ○ DateTime: 2014-09-11T12:20:14.1234+00:00 ○ … more but you won’t probably use them very often
  • 13.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Tuple: A catch-all data type
  • 14.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Bag:
  • 15.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Bag: ➢ And relations? Just the most outer (distributed) bags
  • 16.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Loading data? ➢ Loading data? No, first let’s meet our friend Grunt ➢ Interactive pig shell → Nice for debugging/experimenting ➢ pig -x local or pig -x mapred
  • 17.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Loading data? ➢ Data source: Local or HDFS (usually!) ➢ LOAD instruction: ○ Data is automatically loaded in a distributed relation Students = LOAD ‘student_path’ USING PigStorage( ‘t’, ‘-noschema’ ) AS (student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int); Relation Name Path to HD/HDFS Connector Field separator Tuple schema
  • 18.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Loading data? ➢ Data source: Local or HDFS (usually!) ➢ LOAD instruction: ○ Data is automatically loaded in a distributed relation Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ ); Relation Name Path to HD/HDFS Connector Field separator Load schema from .pig_schema
  • 19.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Checking relations’ content ➢ DUMP instruction: ○ Prints the content of a relation at standard output DUMP Students; (1,John,Doe,M,18) (2,Mary,Doe,F,20) (3,Lara,Croft,F,25) (4,Sherlock,Holmes,M,36) (5,John,Watson,M,38) (6,Sarah,Kerrigan,F,21) (7,Bruce,Wayne,M,32) (8,Tony,Stark,M,33) (9,Princess,Peach,F,21) (10,Peter,Parker,M,23) grunt> Relation Name
  • 20.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Checking relations’ content ➢ DESCRIBE instruction: ○ Prints the schema of the relation at standard output DESCRIBE Students; Students: {student_id: long,name: chararray,surname: chararray,gender: chararray,age: int} grunt> Relation Name
  • 21.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Checking relations’ content ➢ ILLUSTRATE instruction: ○ Prints the schema of the relation and a tuple example at standard output ILLUSTRATE Students; ---------------------------------------------------------------------------- --------------------------------------- | Students | student_id:long | name:chararray | surname:chararray | gender:chararray | age:int | ---------------------------------------------------------------------------- --------------------------------------- | | 9 | Princess | Peach | F | 21 | ---------------------------------------------------------------------------- --------------------------------------- grunt> Relation Name
  • 22.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FOREACH instruction: ○ Generate new relations by projecting data of a relation StudentsProj= FOREACH Students GENERATE student_id, name, age; Relation Name Base relation Projected data
  • 23.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FOREACH instruction: ○ Generate new relations by projecting data of a relation StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name, surname) AS full_name, age; Relation Name Base relation Projected data We can generate new data too!!
  • 24.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FOREACH instruction: ○ Let us execute the instruction and… it seems that nothing happens! ○ We had some tracing output with LOAD, DUMP, and ILLUSTRATE… ○ Any ideas on this issue?
  • 25.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ Pig employs lazy evaluation ➢ Computation only when: ○ LOAD, ILLUSTRATE, DUMP, STORE ➢ Pig keeps a DAG on MR jobs needed to compute relations (optimized!)
  • 26.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Extend Student relation to add a field that determines if the students is under 25 years (1,John,Doe,M,18,true) (2,Mary,Doe,F,20,true) (3,Lara,Croft,F,25,false) (4,Sherlock,Holmes,M,36,false) (5,John,Watson,M,38,false) (6,Sarah,Kerrigan,F,21,true) ... Exercise: Who is under 25?
  • 27.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FILTER instruction: ○ Generate a new relation by filtering data on a relation StudentsFilt= FILTER Students BY age > 24 AND age < 34; DUMP StudentsFilt; (3,Lara,Croft,F,25) (7,Bruce,Wayne,M,32) (8,Tony,Stark,M,33) Relation Name Base relation Condition to fulfill
  • 28.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ SPLIT instruction: ○ Splits a relation into multiple relations based on conditions SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale OTHERWISE; DUMP StudentsMale; (1,John,Doe,M,18) (4,Sherlock,Holmes,M,36) (5,John,Watson,M,38) (7,Bruce,Wayne,M,32) (8,Tony,Stark,M,33) (10,Peter,Parker,M,23) Base relation New relation Condition to fulfill by new relation. Otherwise means the rest New relation Condition to fulfill by new relation
  • 29.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ SPLIT instruction: ○ Splits a relation into multiple relations based on conditions SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF age<30, OtherStudents OTHERWISE; DUMP OtherStudents; (4,Sherlock,Holmes,M,36) (5,John,Watson,M,38) (8,Tony,Stark,M,33)
  • 30.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ GROUP instruction: ○ Creates tuples with the key and a of bag tuples with the same key values StudentsGr = GROUP Students BY gender; DUMP StudentsGr; (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2, Mary,Doe,F,20)}) (M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John, Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)}) DESCRIBE StudentsGr; StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray, surname: chararray,gender: chararray,age: int)}} Base relation New relation Use these fields’ values to make groups New schema!
  • 31.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ GROUP instruction: ○ We can use multiple relations. Creates one bag per relation StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY gender; DUMP StudentsGr; (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{}) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John, Watson,M,38),(4,Sherlock,Holmes,M,36)}) DESCRIBE StudentsGr; StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)},OtherStudents: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}} Base relation New relation Use these fields’ values to make groups New schema! Base relation
  • 32.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ Nested FOREACH: ○ Operate on data in bags inside a relation and then project StudentsNested = FOREACH StudentsGr{ Information = FOREACH Students GENERATE name, surname; GENERATE group AS gender, Information AS student_information; } DUMP StudentsNested; (F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)}) (M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock, Holmes),(John,Doe)}) Base relation New relation Bag inside base relation Finally project New bag
  • 33.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ (inner) JOIN instruction: ○ Our classic database operator for relations! StudentsGrades= JOIN Students BY student_id, Grades BY student_id; DUMP StudentsGrades; (1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5) (1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6) (2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7) … DESCRIBE StudentsGrades; StudentsGrades: {Students::student_id: long,Students::name: chararray, Students::surname: chararray,Students::gender: chararray,Students::age: int, Grades::student_id: long,Grades::course: chararray,Grades::mark: double} Base relation 1 New relation Use these fields’ values to group New schema! Base relation
  • 34.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ (left) JOIN instruction: ○ Our classic database operator for relations! Operating on relations StudentsGrades= JOIN Students BY student_id LEFT, Grades BY student_id; DUMP StudentsGrades; (6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5) (7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5) (8,Tony,Stark,M,33,8,Math,6.7) … DESCRIBE StudentsGrades; StudentsGrades: {Students::student_id: long,Students::name: chararray, Students::surname: chararray,Students::gender: chararray,Students::age: int, Grades::student_id: long,Grades::course: chararray,Grades::mark: double} Left relation New relation Do not forget this one! New schema! Right relation
  • 35.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ CROSS instruction: ○ Cartesian product of two or more relations Operating on relations StudentsCr= CROSS Students, Grades; DUMP StudentsCr; (10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0) (10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5) (10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7) (10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3) … DESCRIBE StudentsCr; StudentsCr: {Students::student_id: long,Students::name: chararray,Students:: surname: chararray,Students::gender: chararray,Students::age: int,Grades:: student_id: long,Grades::course: chararray,Grades::mark: double} Relation 1 New relation Relation 2 New schema!
  • 36.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ UNION instruction: ○ Joins in the same relation multiple relations Operating on relations StudentsUnion= UNION Students, Grades; DUMP StudentsUnion; (1,John,Doe,M,18) (1,Math,5.6) (2,Mary,Doe,F,20) (2,Math,8.9) (3,Lara,Croft,F,25) (3,Math,7.1) … DESCRIBE StudentsUnion; Schema for StudentsUnion unknown. Relation 1 New relation Relation 2 Union does not preserve schemas!
  • 37.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ DISTINCT instruction: ○ Only preserves unique tuples Operating on relations Courses= FOREACH Grades GENERATE course AS course; UniqueCourses= DISTINCT Courses; DUMP UniqueCourses; (Math) (Biology) (Physics) (Engineering) New relation
  • 38.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ ORDER BY instruction: ○ Sorts relations by a specific criteria Operating on relations SortedGrades= ORDER Grades BY mark DESC; DUMP SortedGrades; (2,Biology,10.0) (10,Engineering,10.0) (10,Math,10.0) (5,Biology,10.0) (5,Engineering,9.0) (7,Physics,8.9) … Base relation New relation field(s) used to sort Sort criteria: DESC (descendant) or ASC (ascendant)
  • 39.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ LIMIT instruction: ○ Truncates relation’s size Operating on relations BestGrades= LIMIT SortedGrades 3; DUMP BestGrades; (10,Math,10.0) (10,Engineering,10.0) (2,Biology,10.0) Base relation New relation Maximum number of tuples
  • 40.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ RANK instruction: ○ Appends position of each tuple in the relation Operating on relations RankedGrades= RANK SortedGrades; DUMP RankedGrades; (1,2,Biology,10.0) (2,10,Engineering,10.0) (3,10,Math,10.0) (4,5,Biology,10.0) (5,5,Engineering,9.0) … DESCRIBE RankedGrades; RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray, mark: double} Base relation New relation Rank number!
  • 41.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ RANK instruction: ○ We can also sort and rank! Operating on relations RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC; DUMP RankedGrades; (1,1,Engineering,7.7) (2,1,Math,5.6) (3,1,Biology,4.5) (4,1,Physics,2.3) (5,2,Biology,10.0) … DESCRIBE RankedGrades; RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray, mark: double} Base relation New relation fields to sort
  • 42.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ SAMPLE instruction: ○ Sample the relation! Operating on relations SampledGrades= SAMPLE Grades 0.05; DUMP SampledGrades; (4,Engineering,8.0) Base relation New relation proportion to sample
  • 43.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Get the 3 top grades for each student (1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)}) (2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)}) (3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)}) (4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)}) (5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)}) (6,{(,)}) ... Exercise: Top grades
  • 44.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ CUBE instruction: ○ Is this really useful? Yes! Many aggregates with just one operation Operating on relations CubedGrades= CUBE Grades BY CUBE(student_id,course); CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark); DUMP CubedGrades; ((,Math),7.188888888888889) ((,Biology),7.8) ((,Physics),5.375) ((,Engineering),6.877777777777778) ((,),6.729032258064516) ((2,Math),8.9) ((2,Biology),10.0) ((2,),8.075) …
  • 45.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ CUBE/ROLLUP instruction: ○ Like standard CUBE but nulls values are introduced from right to left Operating on relations RolledGrades= CUBE Grades BY ROLLUP(course,student_id); RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube. mark); DUMP RolledGrades; ((Math,),7.188888888888889) ((Math,2),8.9) ((Math,3),7.1) ((Math,4),2.3) ((Math,5),6.7) ((Math,7),8.5) ((Math,8),6.7) ((Math,9),8.9) … order matters!
  • 46.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ ASSERT instruction: ○ Assert that the whole relation fulfills a condition ○ Useful for debugging Operating on relations ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’; Base relation Error message condition
  • 47.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ STORE instruction: ○ Stores the relation into the local FS or HDFS (usually!) ○ Useful for debugging Finally, storing data! STORE BestGrades INTO ‘best_grades_path’ USING PigStorage( ‘t’, ‘-noschema’ ); Relation path to store data Connector Field separator
  • 48.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Problems solved?!
  • 49.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ ASSERT ➢ GROUP ➢ CROSS ➢ CUBE ➢ DISTINCT ➢ FILTER ➢ FOREACH ➢ GROUP Only these operations? ➢ JOIN ➢ LIMIT ➢ LOAD ➢ ORDER, RANK ➢ SAMPLE ➢ SPLIT ➢ UNION ➢ DUMP, ILLUSTRATE, DESCRIBE
  • 50.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Transform data in data projections ➢ Built-in functions: ○ math functions, string functions, datetime functions, casting functions, etc. ➢ User defined functions: ○ Our own functions written in Java, Python, Ruby, Javascript, etc. Functions & user defined functions
  • 51.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag functions: ○ AVG/MAX/MIN/SUM: compute the average/max/min/sum of a bag of numeric values Functions & user defined functions GradesGr = GROUP Grades BY course; GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades. mark) AS avg_mark; DUMP GradesAvg; (Math,7.188888888888889) (Biology,7.8) (Physics,5.375000000000001) (Engineering,6.877777777777777) Employ only this field in bag/tuple
  • 52.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag functions: ○ COUNT: number of elements (not null) in a bag Functions & user defined functions GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT (Grades) AS number_students; DUMP GradesCount; (Math,9) (Biology,5) (Physics,8) (Engineering,9)
  • 53.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ FLATTEN: behavior depends on input Functions & user defined functions DUMP GradesCount; (Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9), (3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)}) (Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5), (9,Biology,7.8)}) ... GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN (Grades.mark) AS mark; DUMP GradesFlat; (Math,6.7) (Math,5.6) (Math,10.0) …
  • 54.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ FLATTEN: behavior depends on input Functions & user defined functions GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course, mark) AS tuple_mark; DUMP GradesTuple (1,(Math,5.6)) (2,(Math,8.9)) (3,(Math,7.1)) (4,(Math,2.3)) ... GradesUntupled= FOREACH GradesTuple GENERATE student_id AS student_id, FLATTEN(tuple_mark); DUMP GradesUntupled; (1,Math,5.6) (2,Math,8.9) (3,Math,7.1) (4,Math,2.3) …
  • 55.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ SUBTRACT: Tuples on first bag not in the second Functions & user defined functions SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE; StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender; DUMP StudentsCoGr (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)}) StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT( StudentsUnder25, StudentsUnder20 ); DUMP StudentsSub; (F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)}) (M,{(10,Peter,Parker,M,23)})
  • 56.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ DIFF: Non overlapping tuples on two bags Functions & user defined functions DUMP StudentsCoGr (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)}) StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF (StudentsUnder25, StudentsUnder20); DUMP StudentsDiff; (F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)}) (M,{(10,Peter,Parker,M,23)})
  • 57.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Math functions: ○ Common math functions for numeric values: ■ ABS ■ EXP ■ FLOOR ■ LOG ■ RANDOM ■ ROUND ■ SQRT ■ ... Functions & user defined functions
  • 58.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ String functions: ○ Transform chararrays: ■ ENDSWITH ■ LOWER ■ UPPER ■ SUBSTRING ■ TRIM ■ REPLACE ■ ... Functions & user defined functions
  • 59.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Datetime functions: ○ Get information on dates and timestamps: ■ AddDuration ■ CurrentTime ■ ToDate ■ ToString ■ ToUnixTime ■ DaysBetween ■ ... Functions & user defined functions
  • 60.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image public class SHUFFLE extends EvalFunc<DataBag> { @Override public DataBag exec( Tuple input ) throws IOException { if ( input == null ) throw new IOException("Invalid input: null"); if( input.size() != 1 ) throw new IOException("Expected one argument"); if( input.get( 0 ) == null ) return null; TupleFactory tf = TupleFactory.getInstance(); DataBag bag = (DataBag) input.get( 0 ); List<Tuple> l = new ArrayList<Tuple>(); for( Tuple t : bag ) l.add( t ); Collections.shuffle( l ); DataBag resBag = B BagFactory.getInstance(). newDefaultBag( l ); return resBag; } User defined functions @Override public Schema outputSchema( Schema input ) { try { return new Schema( input.getField( 0 ) ); } catch( Exception e ){ return null; } } }
  • 61.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Library of useful UDFs released 2010 ➢ Created by LinkedIn engineering team: ○ Stats: variance, quantiles, median, etc. ○ Bags: concat, append, preped, etc. ○ Sampling ○ Page rank ○ Session estimation ➢ Last major release: 1.2.0 (Dec, 2013) http://datafu.incubator.apache.org/ More functions: Datafu Pig
  • 62.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image How to use UDF libraries REGISTER lib/datafu-1.2.0.jar DEFINE BagConcat datafu.pig.bags.BagConcat(); DUMP StudentsCoGr (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{}) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)}) StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat (StudentsUnder25,StudentsUnder20); DUMP StudentBagConcat (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)}) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)}) Indicate UDF to be included and name
  • 63.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Scripting REGISTER lib/datafu-1.2.0.jar DEFINE BagConcat datafu.pig.bags.BagConcat(); Students= LOAD ‘$student_file’ USING PigStorage( ‘t’, ‘-noschema’ ) AS ( student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int) SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE; StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender; StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat (StudentsUnder25,StudentsUnder20); STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘t’, ‘-schema’ ); A s d a Libraries and Udfs Loaddata TransformdataStoredata parameter parameter
  • 64.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Calling a script pig -x mapred -f myscript.pig -param student_file=students.csv -param output=myoutput_path parameter definition execution mode script file
  • 65.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Not limited to plain text ➢ Multiple supported format: Json, Avro, Accumulo, etc. ➢ Connectors to data sources: MongoDb, Cassandra, HBase, etc. More on load/store
  • 66.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Detect pairs of products bought together (e.g., chairs and tables) ➢ Goal: recommend related products ➢ Association score: Exercise: Product association
  • 67.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Purchases: purchases.tsv ➢ Products: products.tsv Product association product_id user_id price date 1 23 14.5 2014-03-03 4 15 11.2 2014-08-09 88 3 48.3 2011-01-01 ... product_id status 1 ok 5 ko 99 ok ...
  • 68.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Time to work!
  • 69.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Clear and simple syntax ➢ Interactive client ➢ Transparent M/R jobs ➢ Integration with Java and others Final notes: Pros & cons ➢ Not as flexible as Hadoop ➢ Oriented towards ETL, not AI ➢ No loops
  • 70.
    Apache Pig: Makingdata transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ http://pig.apache.org/ ➢ Programming pig. Alan Gates. Ed. O’Reilly ➢ StackOverflow Extra information