Intro to Pig UDF

Introduction to
Pig UDFs

Chris Wilkes
cwilkes@seattlehadoop.org

Agenda

1 What, Why, How

2 EvalFunc basics

3 More EvalFunc
4 LoadFunc

5 Piggybank

Agenda Point 1

1 What, Why, How

2 EvalFunc basics

3 More EvalFunc
4 LoadFunc

5 Piggybank

What is a UDF?

User Defined Function

• Way to do an operation on a field or fields
• Note: not on the group
• Called from within a pig script
• b = FOREACH a GENERATE foo(color)
• Currently all done in java

Why use a UDF?

• You need to do more than grouping or
ﬁltering
• Actually ﬁltering is a UDF
• Probably using them already
• Maybe more comfortable in java land
than in SQL / Pig Latin

How to write an use?

• Just extend / implement an
interface
• No need for administrator
rights, just call your script
• Very simple java, just think
about your small problem

Magical Powers not required

Moving right along

Now to the informative part of the talk

EvalFunc : probably what you need to do

•Easiest to understand: takes one or more
ﬁelds and spits back a generic object
•Extend the EvalFunc interface and it
practically writes itself
•Let’s look at the UPPER example from the
piggybank

The UPPER EvalFunc

public class UPPER extends EvalFunc<String> {

@Override
public String exec(Tuple input) throws IOException {
if (input == null||input.size() == 0||input.get(0) == null)
return null;
try {
return ((String)input.get(0)).toUpperCase();
} catch (ClassCastException e) {
warn(“error msg”, PigWarning.UDF_WARNING_1);
} catch(Exception e){
warn("Error”, PigWarning.UDF_WARNING_1);
}
return null;
}

}

modiﬁed version from the piggybank SVN

The UPPER EvalFunc


@Override
return null;
try {
}
return null;
}

}

The generic <String> tells Pig what class will be
returned from this method

The UPPER EvalFunc


@Override
return null;
try {
}
return null;
}

}

The Tuple input contains the ﬁelds within the script ()

The UPPER EvalFunc


return null;
try {
}
return null;
}

}

Check your inputs for empties or nulls

The UPPER EvalFunc


return null;
try {
}
return null;
}

}

You have to know that the 1st parameter inside the
tuple is a String

The UPPER EvalFunc


return null;
try {
}
return null;
}

}

Catch errors that are acceptable and return null so
can be skipped over

The UPPER EvalFunc

public List<FuncSpec> getArgToFuncMapping() {
List<FuncSpec> funcList = new ArrayList<FuncSpec>();
funcList.add(new FuncSpec(this.getClass().getName(),
new Schema(new Schema.FieldSchema(null,
DataType.CHARARRAY))));
return funcList;
}
}

Tells Pig what parameters this function takes

Recap of UPPER

• Generics outlines contract for return type
• Schemas are preserved (chararray / String)
• Check inputs for empty or null
• Return null if item should be skipped
• Throw an exception if deadly
• Name “UPPER” can be used if known to
PigContext’s packageImportList, otherwise need
full classname
• Cast items inside of the Tuple parameter

Another simple EvalFunc: AstroDist

• Two input ﬁles: planet names with coordinates
and pairs of planets
• Goal: ﬁnd the distance between the pairs
• Loading is slightly different: coords in a tuple
• Input to EvalFunc is a Tuple that contains a Tuple

AstroDist input ﬁles

$ cat src/test/resources/cosmo
aaa bbb
aaa ccc
ddd aaa

$ cat src/test/resources/planets
aaa (1,0,10)
bbb (2,-5,15)
ccc (-7,12,48) image from xkcd.com
ddd (3,3,8)

AstroDist pig script

REGISTER target/pig-demo-1.0-SNAPSHOT.jar;

planets = load '$dir/planets' as (name : chararray,
l:tuple(x : int, y : int, z : int));
cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray);

A = JOIN cosmo BY planet1, planets BY name;
B = JOIN A by planet2, planets BY name;

locations = FOREACH B GENERATE
$1 AS p1name:chararray,
$2 AS p2name : chararray,
AstroDist($3,$5) as distance;

dump locations;

AstroDist output

$ pig -x local -f src/test/resources/distances.pig
-param dir=src/test/resources/

What B looks like:
(ddd,aaa,ddd,(3,3,8),aaa,(1,0,10))
(aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15))
(aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48))

Output:
(aaa,ddd,4.123105625617661)
(bbb,aaa,7.14142842854285)
(ccc,aaa,40.64480286580315)

AstroDist program

public class AstroDist extends EvalFunc<Double> {
@Override
public Double exec(Tuple input) throws IOException {
Point3D astroPos1 = new Point3D((Tuple) input.get(0));
Point3D astroPos2 = new Point3D((Tuple) input.get(1));
return astroPos1.distance(astroPos2);
}
@Override
public List<FuncSpec> getArgToFuncMapping() {
Schema s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.TUPLE));
s.add(new Schema.FieldSchema(null, DataType.TUPLE));
return Arrays.asList(
new FuncSpec(this.getClass().getName(), s));
}
}

AstroDist program (cont)

private static class Point3D {
private final int x, y, z;
private Point3D(Tuple tuple) throws ExecException {

if (tuple.size() != 3) {

throw new ExecException("Received " + tuple.size() +
" points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG);

}

x = (Integer) tuple.get(0);

y = (Integer) tuple.get(1);

z = (Integer) tuple.get(2);

}

private double distance(Point3D other) {

return Math.sqrt(Math.pow(x - other.x, 2) +
Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2));

}
}

Fun times when running this script

• Looking through PigContext and Main found
that /pig.properties in the classpath is parsed for
the key/value “udf.import.list”
• Put this into my jar (src/main/resources with
maven) but it didn’t appear to load
• Debug log should show what’s going on, except
debug isn’t turned on till after this load
• Ended up putting into ~/.pigrc but Pig warns that
it should go into conf/pig.properties, a ﬁle that
isn’t read
• Schemas and UDFs are picky, use trial and error

Agenda Point 3

1 What, Why, How

2 EvalFunc basics

3 More EvalFunc
4 LoadFunc

5 Piggybank

Returning a Tuple from a UDF

• Sometimes you want to return more than one
thing from a function
• For example an expensive calculation was done
and its results can be reused
• But what should be returned?
• Of course a Tuple
• “tuple” is the answer 92% of the time

http://tuplemusic.org/
Tuple is dedicated to exploring and expanding
the contemporary repertoire for two bassoons

BestBook: returns the highest scored book

$ cat src/test/resources/bookscores
book1 aaa 1
book1 bbb 3
Want output of that for
book1 ccc 12
book3 reviewer bbb was
book2 aaa 4
the highest at 5
book2 bbb 1
book3 ccc 1
book3 bbb 5

BestBook EvalFunc

public class BestBook extends EvalFunc<Tuple> {

@Override

public Tuple exec(Tuple p_input) throws IOException {

Iterator<Tuple> bagReviewers =
((DataBag) p_input.get(0)).iterator();

Iterator<Tuple> bagScores =

int bestScore = -1;

String bestReviewer = null;

while (bagReviewers.hasNext() && bagScores.hasNext()) {

String reviewerName = (String) bagReviewers.next().get(0);

Integer score = (Integer) bagScores.next().get(0);

if (score.intValue() > bestScore) {

bestScore = score;

bestReviewer = reviewerName;

}

}

return TupleFactory.getInstance().newTuple(
Arrays.asList(bestReviewer, (Integer) bestScore));

}

BestBook EvalFunc


@Override




int bestScore = -1;






bestScore = score;


}

}


}
The inputs are bag “columns”

BestBook EvalFunc


@Override




int bestScore = -1;






bestScore = score;


}

}


}
return a Tuple that’s just like the inputs

BestBook EvalFunc


@Override

public Schema outputSchema(Schema p_input) {

try {

return Schema.generateNestedSchema(DataType.TUPLE,
DataType.CHARARRAY, DataType.INTEGER);

} catch (FrontendException e) {

throw new IllegalStateException(e);

}

}
}

How to deﬁne the outbound
schema inside the Tuple


REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;

A = LOAD '$dir/bookscores' as (name : chararray,
reviewer : chararray, score : int);

B = group A by name;
describe B;
dump B;

C = FOREACH B GENERATE group,
BestBook(A.reviewer, A.score) as reviewandscore;

describe C;
dump C;


B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}}
(book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)})
(book2,{(book2,aaa,4),(book2,bbb,1)})
(book3,{(book3,ccc,1),(book3,bbb,5)})

C: {group: chararray,reviewandscore: (chararray,int)}
(book1,(ccc,12))
(book2,(aaa,4))
(book3,(bbb,5))

BestBook: improve by implementing Algebraic

•If EvalFunc can be run in stages and summed up consider
implementing Algebraic
•Three methods to override:
•String getInitial();
•String getIntermed();
•String getFinal()
•See COUNT and DoubleAvg

FilterFunc: a filter that’s an EvalFunc

• For keeping and disgarding entries write a filter
• FilterFunc extends EvalFunc<Boolean>
• Adds a method “void finish()” for cleanup
• Example: only wants dates that are within 10
minutes of one another

FilterFunc: DateWithinFilter

public class DateWithinFilter extends FilterFunc {

@Override

public Boolean exec(Tuple input) throws IOException {

if (input.size() != 3) {

throw new IOException(“error msg”);

}

Date[] startAndTryDates = getColumnDates(input);

if (startAndTryDates == null)

return false;

long dateDiff = startAndTryDates[1].getTime() -
startAndTryDates[0].getTime();

if (dateDiff < 0) {

return false; // maybe make optional

}

int maxDateDiff = (Integer) input.get(2);

return dateDiff <= maxDateDiff;

}

private Date[] getColumnDates(Tuple input) throws ExecException {

String strDate1 = (String) input.get(0);

String strDate2 = (String) input.get(1);

if (strDate1 == null || strDate2 == null) {

return null;

}

Date date1 = null;

try {

date1 = df.parse(strDate1);

} catch (ParseException e) {

warn(“date format err”, PigWarning.UDF_WARNING_1);

return null;

}

Date date2 = null;

try {

date2 = df.parse(strDate2);

} catch (ParseException e) {

warn(“date format err”, PigWarning.UDF_WARNING_1);

return null;

}

return new Date[] { date1, date2 };

}


@Override

public List<FuncSpec> getArgToFuncMapping() throws
FrontendException {

List<FuncSpec> funcList = new ArrayList<FuncSpec>();

Schema s = new Schema();

s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));

s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));

s.add(new Schema.FieldSchema(null, DataType.INTEGER));

funcList.add(new FuncSpec(this.getClass().getName(), s));

return funcList;

}

Deﬁning what inputs we accept
stay tuned for what happens when violated

$ cat src/test/resources/purchasetimes
1234 2010-06-01 10:31:22 2010-06-01 10:32:22
7121 2010-06-01 10:30:18 2010-06-01 11:02:59
1234 2010-06-01 10:40:18 2010-06-01 10:45:32
7681 lol wut
4532 2010-06-01 11:37:18 2010-06-01 11:42:59

$ cat src/test/resources/purchasetimes.pig
REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;
purchasetimes = LOAD '$dir/purchasetimes' AS
(userid: int, datein: chararray, dateout: chararray);
quickybuyers = FILTER purchasetimes BY
DateWithinFilter(datein, dateout, 600000);
DUMP quickybuyers; $ pig -x local -f src/test/resources/purchasetimes.pig
-param dir=src/test/resources/
(1234,2010-06-01 10:31:22,2010-06-01 10:32:22)
(1234,2010-06-01 10:40:18,2010-06-01 10:45:32)
(4532,2010-06-01 11:37:18,2010-06-01 11:42:59)

EvalFunc: not passing in correct number args
$ cat src/test/resources/purchasetimes.pig
quickybuyers = FILTER purchasetimes BY
DateWithinFilter(datein, dateout);

$ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/

2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1045: Could not infer the matching function for
org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit.
Please use an explicit cast.
Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop-
demo-code/pig_1276820742917.log

log file has:
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi
sitor.java:1197)
so error caught before loading data

LoadFunc: deﬁnition

• How does something get loaded into Pig?
• A = load ‘B’;
• But what is actually going on?
• A = load ‘B’ using PigStorage();
• PigStorage is a LoadFunc that reads off of disk
and splits on tab to create a Tuple

LoadFunc: deﬁnition

• LoadFunc is an interface with a number of
methods, the most interesting being
• bindTo(ﬁleName,inputStream,offset,end)
• Tuple getNext()
• Extend from UTF8StorageConverter like
PigStorage to get defaults
• Overview: PigStorage’s getNext() creates an array
of objects after splitting on a tab and puts those
into a Tuple

LoadFunc: make our own

• Have a lot of log ﬁles, some just contain a URL
• http://example.com?use=mind+bullets&target=yak
• Want to load URLs and do analysis
• Write your own LoadFunc to do this that takes a
URL and returns a Map of the query parameters
• Know what parameters you care about, only look
for those

LoadFunc: make our own

• Have a lot of log ﬁles, some just contain a URL
• http://example.com?use=mind+bullets&target=yak
• Want to load URLs and do analysis
• Write your own LoadFunc to do this that takes a
URL and returns a Map of the query parameters
• Know what parameters you care about, only look
for those
• Goal:
• A = LOAD 'urls' USING
QuerystringLoader('query', 'userid') AS (query:
chararray, userid : int);

LoadFunc: QuerystringLoader

• Passing in constructor arguments from the pig
script is easy:
• public QuerystringLoader(String... ﬁeldNames)
• bindTo is almost exactly the same as the
PigStorage one, using the PigLineRecordReader
to parse the InputStream
• Tuple getTuple() is where the action happens
• parse the querystring into a Map
• loop through the ﬁelds given in the constructor
• return a Tuple of a list of those objects

LoadFunc: QuerystringLoader getTuple()

@Override

public Tuple getNext() throws IOException {

if (in == null || in.getPosition() > end) {

return null;

}

Text value = new Text();

boolean notDone = in.next(value);

if (!notDone) {

return null;

}

Map<String, Object> parameters = getParameterMap(value.toString());

List<String> output = new ArrayList<String>();

for (String fieldName : m_fieldsInOrder) {

Object object = parameters.get(fieldName);

if (object == null) {

output.add(null);

continue;

}

if (object instanceof String) {

output.add((String) object);

} else {

List<String> objectVal = (List<String>) object;

output.add(objectVal.get(0));

}

}

return mTupleFactory.newTupleNoCopy(output);

}

LoadFunc: notes

• boolean okay=in.next(tuple) is how to get the next
parsed line
• getParameterMap(url) splits querystring into a
Map<String,Object>
• Pig handles type conversion for you, just hand back
a Tuple.
• In this case the Tuple can be made up of anything so
user speciﬁes the schema in the script
• AS (query:chararray, userid:int)

RegexLoader

Same concept, pass in a Pattern for the constructor
and have getTuple() return only the matched parts
@Override
public Tuple getNext() throws IOException {

Matcher m = m_linePattern.matcher(value.toString());
if (!m.matches()) {
return EmptyTuple.getInstance();
}
List<String> regexMatches = new ArrayList<String>();
for (int i = 1; i <= m.groupCount(); i++) {
regexMatches.add(m.group(i));
}
return mTupleFactory.newTupleNoCopy(regexMatches);
}

Piggybank

• CVS repository of common UDFs
• Excited about it at ﬁrst, doesn’t appear to be
used that much
• Needs to be an easier way of doing this
• CPAN (Perl) for Pig would be great
• register pigpan://Math::FFT
• brings down the jars from a maven-like
repository and tells pig where to load from
• any takers? Looking into it

Bonus section: unit testing
@Test

public void testRepeatQueryParams() throws IOException {

String url = "http://localhost/foo?a=123&a=456nx=y
nhttp://localhost/bar?a=761&b=hi";

QuerystringLoader loader = new QuerystringLoader("a", "b");

InputStream in = new ByteArrayInputStream(url.getBytes());

loader.bindTo(null,
new BufferedPositionedInputStream(in), 0, url.length());

Tuple tuple = loader.getNext();

assertEquals("123", (String) tuple.get(0));

assertNull(tuple.get(1));

tuple = loader.getNext();

assertEquals(2, tuple.size());



tuple = loader.getNext();

assertEquals("761", (String) tuple.get(0));

assertEquals("hi", (String) tuple.get(1));

}

Resources

UDF reference:
http://hadoop.apache.org/pig/docs/r0.5.0/
piglatin_reference.html

Code samples:
http://github.com/seattlehadoop

Presentation:
http://www.slideshare.net/seattlehadoop

Intro to Pig UDF

More Related Content

What's hot

Viewers also liked

Similar to Intro to Pig UDF

Recently uploaded

Intro to Pig UDF