Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Intro to Pig UDF
1. Introduction to
Pig UDFs
Chris Wilkes
cwilkes@seattlehadoop.org
2. Agenda
1 What, Why, How
2 EvalFunc basics
3 More EvalFunc
4 LoadFunc
5 Piggybank
3. Agenda Point 1
1 What, Why, How
2 EvalFunc basics
3 More EvalFunc
4 LoadFunc
5 Piggybank
4. What is a UDF?
User Defined Function
• Way to do an operation on a field or fields
• Note: not on the group
• Called from within a pig script
• b = FOREACH a GENERATE foo(color)
• Currently all done in java
5. Why use a UDF?
• You need to do more than grouping or
filtering
• Actually filtering is a UDF
• Probably using them already
• Maybe more comfortable in java land
than in SQL / Pig Latin
6. How to write an use?
• Just extend / implement an
interface
• No need for administrator
rights, just call your script
• Very simple java, just think
about your small problem
Magical Powers not required
8. Agenda
1 What, Why, How
2 EvalFunc basics
3 More EvalFunc
4 LoadFunc
5 Piggybank
9. EvalFunc : probably what you need to do
•Easiest to understand: takes one or more
fields and spits back a generic object
•Extend the EvalFunc interface and it
practically writes itself
•Let’s look at the UPPER example from the
piggybank
10. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}
}
modified version from the piggybank SVN
11. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}
}
The generic <String> tells Pig what class will be
returned from this method
12. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}
}
The Tuple input contains the fields within the script ()
13. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}
}
Check your inputs for empties or nulls
14. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}
}
You have to know that the 1st parameter inside the
tuple is a String
15. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}
}
Catch errors that are acceptable and return null so
can be skipped over
16. The UPPER EvalFunc
public class UPPER extends EvalFunc<String> {
public List<FuncSpec> getArgToFuncMapping() {
List<FuncSpec> funcList = new ArrayList<FuncSpec>();
funcList.add(new FuncSpec(this.getClass().getName(),
new Schema(new Schema.FieldSchema(null,
DataType.CHARARRAY))));
return funcList;
}
}
Tells Pig what parameters this function takes
17. Recap of UPPER
• Generics outlines contract for return type
• Schemas are preserved (chararray / String)
• Check inputs for empty or null
• Return null if item should be skipped
• Throw an exception if deadly
• Name “UPPER” can be used if known to
PigContext’s packageImportList, otherwise need
full classname
• Cast items inside of the Tuple parameter
18. Another simple EvalFunc: AstroDist
• Two input files: planet names with coordinates
and pairs of planets
• Goal: find the distance between the pairs
• Loading is slightly different: coords in a tuple
• Input to EvalFunc is a Tuple that contains a Tuple
20. AstroDist pig script
REGISTER target/pig-demo-1.0-SNAPSHOT.jar;
planets = load '$dir/planets' as (name : chararray,
l:tuple(x : int, y : int, z : int));
cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray);
A = JOIN cosmo BY planet1, planets BY name;
B = JOIN A by planet2, planets BY name;
locations = FOREACH B GENERATE
$1 AS p1name:chararray,
$2 AS p2name : chararray,
AstroDist($3,$5) as distance;
dump locations;
21. AstroDist output
$ pig -x local -f src/test/resources/distances.pig
-param dir=src/test/resources/
What B looks like:
(ddd,aaa,ddd,(3,3,8),aaa,(1,0,10))
(aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15))
(aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48))
Output:
(aaa,ddd,4.123105625617661)
(bbb,aaa,7.14142842854285)
(ccc,aaa,40.64480286580315)
22. AstroDist program
public class AstroDist extends EvalFunc<Double> {
@Override
public Double exec(Tuple input) throws IOException {
Point3D astroPos1 = new Point3D((Tuple) input.get(0));
Point3D astroPos2 = new Point3D((Tuple) input.get(1));
return astroPos1.distance(astroPos2);
}
@Override
public List<FuncSpec> getArgToFuncMapping() {
Schema s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.TUPLE));
s.add(new Schema.FieldSchema(null, DataType.TUPLE));
return Arrays.asList(
new FuncSpec(this.getClass().getName(), s));
}
}
23. AstroDist program (cont)
private static class Point3D {
private final int x, y, z;
private Point3D(Tuple tuple) throws ExecException {
if (tuple.size() != 3) {
throw new ExecException("Received " + tuple.size() +
" points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG);
}
x = (Integer) tuple.get(0);
y = (Integer) tuple.get(1);
z = (Integer) tuple.get(2);
}
private double distance(Point3D other) {
return Math.sqrt(Math.pow(x - other.x, 2) +
Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2));
}
}
24. Fun times when running this script
• Looking through PigContext and Main found
that /pig.properties in the classpath is parsed for
the key/value “udf.import.list”
• Put this into my jar (src/main/resources with
maven) but it didn’t appear to load
• Debug log should show what’s going on, except
debug isn’t turned on till after this load
• Ended up putting into ~/.pigrc but Pig warns that
it should go into conf/pig.properties, a file that
isn’t read
• Schemas and UDFs are picky, use trial and error
25. Agenda Point 3
1 What, Why, How
2 EvalFunc basics
3 More EvalFunc
4 LoadFunc
5 Piggybank
26. Returning a Tuple from a UDF
• Sometimes you want to return more than one
thing from a function
• For example an expensive calculation was done
and its results can be reused
• But what should be returned?
• Of course a Tuple
• “tuple” is the answer 92% of the time
http://tuplemusic.org/
Tuple is dedicated to exploring and expanding
the contemporary repertoire for two bassoons
27. BestBook: returns the highest scored book
$ cat src/test/resources/bookscores
book1 aaa 1
book1 bbb 3
Want output of that for
book1 ccc 12
book3 reviewer bbb was
book2 aaa 4
the highest at 5
book2 bbb 1
book3 ccc 1
book3 bbb 5
29. BestBook EvalFunc
public class BestBook extends EvalFunc<Tuple> {
@Override
public Tuple exec(Tuple p_input) throws IOException {
Iterator<Tuple> bagReviewers =
((DataBag) p_input.get(0)).iterator();
Iterator<Tuple> bagScores =
((DataBag) p_input.get(1)).iterator();
int bestScore = -1;
String bestReviewer = null;
while (bagReviewers.hasNext() && bagScores.hasNext()) {
String reviewerName = (String) bagReviewers.next().get(0);
Integer score = (Integer) bagScores.next().get(0);
if (score.intValue() > bestScore) {
bestScore = score;
bestReviewer = reviewerName;
}
}
return TupleFactory.getInstance().newTuple(
Arrays.asList(bestReviewer, (Integer) bestScore));
}
The inputs are bag “columns”
30. BestBook EvalFunc
public class BestBook extends EvalFunc<Tuple> {
@Override
public Tuple exec(Tuple p_input) throws IOException {
Iterator<Tuple> bagReviewers =
((DataBag) p_input.get(0)).iterator();
Iterator<Tuple> bagScores =
((DataBag) p_input.get(1)).iterator();
int bestScore = -1;
String bestReviewer = null;
while (bagReviewers.hasNext() && bagScores.hasNext()) {
String reviewerName = (String) bagReviewers.next().get(0);
Integer score = (Integer) bagScores.next().get(0);
if (score.intValue() > bestScore) {
bestScore = score;
bestReviewer = reviewerName;
}
}
return TupleFactory.getInstance().newTuple(
Arrays.asList(bestReviewer, (Integer) bestScore));
}
return a Tuple that’s just like the inputs
31. BestBook EvalFunc
public class BestBook extends EvalFunc<Tuple> {
@Override
public Schema outputSchema(Schema p_input) {
try {
return Schema.generateNestedSchema(DataType.TUPLE,
DataType.CHARARRAY, DataType.INTEGER);
} catch (FrontendException e) {
throw new IllegalStateException(e);
}
}
}
How to define the outbound
schema inside the Tuple
32. BestBook: returns the highest scored book
REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;
A = LOAD '$dir/bookscores' as (name : chararray,
reviewer : chararray, score : int);
B = group A by name;
describe B;
dump B;
C = FOREACH B GENERATE group,
BestBook(A.reviewer, A.score) as reviewandscore;
describe C;
dump C;
34. BestBook: improve by implementing Algebraic
•If EvalFunc can be run in stages and summed up consider
implementing Algebraic
•Three methods to override:
•String getInitial();
•String getIntermed();
•String getFinal()
•See COUNT and DoubleAvg
35. FilterFunc: a filter that’s an EvalFunc
• For keeping and disgarding entries write a filter
• FilterFunc extends EvalFunc<Boolean>
• Adds a method “void finish()” for cleanup
• Example: only wants dates that are within 10
minutes of one another
36. FilterFunc: DateWithinFilter
public class DateWithinFilter extends FilterFunc {
@Override
public Boolean exec(Tuple input) throws IOException {
if (input.size() != 3) {
throw new IOException(“error msg”);
}
Date[] startAndTryDates = getColumnDates(input);
if (startAndTryDates == null)
return false;
long dateDiff = startAndTryDates[1].getTime() -
startAndTryDates[0].getTime();
if (dateDiff < 0) {
return false; // maybe make optional
}
int maxDateDiff = (Integer) input.get(2);
return dateDiff <= maxDateDiff;
}
40. EvalFunc: not passing in correct number args
$ cat src/test/resources/purchasetimes.pig
quickybuyers = FILTER purchasetimes BY
DateWithinFilter(datein, dateout);
$ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/
2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1045: Could not infer the matching function for
org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit.
Please use an explicit cast.
Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop-
demo-code/pig_1276820742917.log
log file has:
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi
sitor.java:1197)
so error caught before loading data
41. Agenda
1 What, Why, How
2 EvalFunc basics
3 More EvalFunc
4 LoadFunc
5 Piggybank
42. LoadFunc: definition
• How does something get loaded into Pig?
• A = load ‘B’;
• But what is actually going on?
• A = load ‘B’ using PigStorage();
• PigStorage is a LoadFunc that reads off of disk
and splits on tab to create a Tuple
43. LoadFunc: definition
• LoadFunc is an interface with a number of
methods, the most interesting being
• bindTo(fileName,inputStream,offset,end)
• Tuple getNext()
• Extend from UTF8StorageConverter like
PigStorage to get defaults
• Overview: PigStorage’s getNext() creates an array
of objects after splitting on a tab and puts those
into a Tuple
44. LoadFunc: make our own
• Have a lot of log files, some just contain a URL
• http://example.com?use=mind+bullets&target=yak
• Want to load URLs and do analysis
• Write your own LoadFunc to do this that takes a
URL and returns a Map of the query parameters
• Know what parameters you care about, only look
for those
45. LoadFunc: make our own
• Have a lot of log files, some just contain a URL
• http://example.com?use=mind+bullets&target=yak
• Want to load URLs and do analysis
• Write your own LoadFunc to do this that takes a
URL and returns a Map of the query parameters
• Know what parameters you care about, only look
for those
• Goal:
• A = LOAD 'urls' USING
QuerystringLoader('query', 'userid') AS (query:
chararray, userid : int);
46. LoadFunc: QuerystringLoader
• Passing in constructor arguments from the pig
script is easy:
• public QuerystringLoader(String... fieldNames)
• bindTo is almost exactly the same as the
PigStorage one, using the PigLineRecordReader
to parse the InputStream
• Tuple getTuple() is where the action happens
• parse the querystring into a Map
• loop through the fields given in the constructor
• return a Tuple of a list of those objects
47. LoadFunc: QuerystringLoader getTuple()
@Override
public Tuple getNext() throws IOException {
if (in == null || in.getPosition() > end) {
return null;
}
Text value = new Text();
boolean notDone = in.next(value);
if (!notDone) {
return null;
}
Map<String, Object> parameters = getParameterMap(value.toString());
List<String> output = new ArrayList<String>();
for (String fieldName : m_fieldsInOrder) {
Object object = parameters.get(fieldName);
if (object == null) {
output.add(null);
continue;
}
if (object instanceof String) {
output.add((String) object);
} else {
List<String> objectVal = (List<String>) object;
output.add(objectVal.get(0));
}
}
return mTupleFactory.newTupleNoCopy(output);
}
48. LoadFunc: notes
• boolean okay=in.next(tuple) is how to get the next
parsed line
• getParameterMap(url) splits querystring into a
Map<String,Object>
• Pig handles type conversion for you, just hand back
a Tuple.
• In this case the Tuple can be made up of anything so
user specifies the schema in the script
• AS (query:chararray, userid:int)
49. RegexLoader
Same concept, pass in a Pattern for the constructor
and have getTuple() return only the matched parts
@Override
public Tuple getNext() throws IOException {
Matcher m = m_linePattern.matcher(value.toString());
if (!m.matches()) {
return EmptyTuple.getInstance();
}
List<String> regexMatches = new ArrayList<String>();
for (int i = 1; i <= m.groupCount(); i++) {
regexMatches.add(m.group(i));
}
return mTupleFactory.newTupleNoCopy(regexMatches);
}
50. Agenda
1 What, Why, How
2 EvalFunc basics
3 More EvalFunc
4 LoadFunc
5 Piggybank
51. Piggybank
• CVS repository of common UDFs
• Excited about it at first, doesn’t appear to be
used that much
• Needs to be an easier way of doing this
• CPAN (Perl) for Pig would be great
• register pigpan://Math::FFT
• brings down the jars from a maven-like
repository and tells pig where to load from
• any takers? Looking into it
52. Bonus section: unit testing
@Test
public void testRepeatQueryParams() throws IOException {
String url = "http://localhost/foo?a=123&a=456nx=y
nhttp://localhost/bar?a=761&b=hi";
QuerystringLoader loader = new QuerystringLoader("a", "b");
InputStream in = new ByteArrayInputStream(url.getBytes());
loader.bindTo(null,
new BufferedPositionedInputStream(in), 0, url.length());
Tuple tuple = loader.getNext();
assertEquals("123", (String) tuple.get(0));
assertNull(tuple.get(1));
tuple = loader.getNext();
assertEquals(2, tuple.size());
assertNull(tuple.get(0));
assertNull(tuple.get(1));
tuple = loader.getNext();
assertEquals("761", (String) tuple.get(0));
assertEquals("hi", (String) tuple.get(1));
}