Your SlideShare is downloading. ×
0
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Intro to Pig UDF
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Intro to Pig UDF

20,745

Published on

Introduction on writing pig UDFs

Introduction on writing pig UDFs

Published in: Technology, Business
1 Comment
29 Likes
Statistics
Notes
No Downloads
Views
Total Views
20,745
On Slideshare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
596
Comments
1
Likes
29
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide










































































































  • Transcript

    • 1. Introduction to Pig UDFs Chris Wilkes cwilkes@seattlehadoop.org
    • 2. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
    • 3. Agenda Point 1 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
    • 4. What is a UDF? User Defined Function • Way to do an operation on a field or fields • Note: not on the group • Called from within a pig script • b = FOREACH a GENERATE foo(color) • Currently all done in java
    • 5. Why use a UDF? • You need to do more than grouping or filtering • Actually filtering is a UDF • Probably using them already • Maybe more comfortable in java land than in SQL / Pig Latin
    • 6. How to write an use? • Just extend / implement an interface • No need for administrator rights, just call your script • Very simple java, just think about your small problem Magical Powers not required
    • 7. Moving right along Now to the informative part of the talk
    • 8. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
    • 9. EvalFunc : probably what you need to do •Easiest to understand: takes one or more fields and spits back a generic object •Extend the EvalFunc interface and it practically writes itself •Let’s look at the UPPER example from the piggybank
    • 10. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } modified version from the piggybank SVN
    • 11. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The generic <String> tells Pig what class will be returned from this method
    • 12. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The Tuple input contains the fields within the script ()
    • 13. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Check your inputs for empties or nulls
    • 14. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } You have to know that the 1st parameter inside the tuple is a String
    • 15. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Catch errors that are acceptable and return null so can be skipped over
    • 16. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public List<FuncSpec> getArgToFuncMapping() { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY)))); return funcList; } } Tells Pig what parameters this function takes
    • 17. Recap of UPPER • Generics outlines contract for return type • Schemas are preserved (chararray / String) • Check inputs for empty or null • Return null if item should be skipped • Throw an exception if deadly • Name “UPPER” can be used if known to PigContext’s packageImportList, otherwise need full classname • Cast items inside of the Tuple parameter
    • 18. Another simple EvalFunc: AstroDist • Two input files: planet names with coordinates and pairs of planets • Goal: find the distance between the pairs • Loading is slightly different: coords in a tuple • Input to EvalFunc is a Tuple that contains a Tuple
    • 19. AstroDist input files $ cat src/test/resources/cosmo aaa bbb aaa ccc ddd aaa $ cat src/test/resources/planets aaa (1,0,10) bbb (2,-5,15) ccc (-7,12,48) image from xkcd.com ddd (3,3,8)
    • 20. AstroDist pig script REGISTER target/pig-demo-1.0-SNAPSHOT.jar; planets = load '$dir/planets' as (name : chararray, l:tuple(x : int, y : int, z : int)); cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray); A = JOIN cosmo BY planet1, planets BY name; B = JOIN A by planet2, planets BY name; locations = FOREACH B GENERATE $1 AS p1name:chararray, $2 AS p2name : chararray, AstroDist($3,$5) as distance; dump locations;
    • 21. AstroDist output $ pig -x local -f src/test/resources/distances.pig -param dir=src/test/resources/ What B looks like: (ddd,aaa,ddd,(3,3,8),aaa,(1,0,10)) (aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15)) (aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48)) Output: (aaa,ddd,4.123105625617661) (bbb,aaa,7.14142842854285) (ccc,aaa,40.64480286580315)
    • 22. AstroDist program public class AstroDist extends EvalFunc<Double> { @Override public Double exec(Tuple input) throws IOException { Point3D astroPos1 = new Point3D((Tuple) input.get(0)); Point3D astroPos2 = new Point3D((Tuple) input.get(1)); return astroPos1.distance(astroPos2); } @Override public List<FuncSpec> getArgToFuncMapping() { Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); return Arrays.asList( new FuncSpec(this.getClass().getName(), s)); } }
    • 23. AstroDist program (cont) private static class Point3D { private final int x, y, z; private Point3D(Tuple tuple) throws ExecException { if (tuple.size() != 3) { throw new ExecException("Received " + tuple.size() + " points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG); } x = (Integer) tuple.get(0); y = (Integer) tuple.get(1); z = (Integer) tuple.get(2); } private double distance(Point3D other) { return Math.sqrt(Math.pow(x - other.x, 2) + Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2)); } }
    • 24. Fun times when running this script • Looking through PigContext and Main found that /pig.properties in the classpath is parsed for the key/value “udf.import.list” • Put this into my jar (src/main/resources with maven) but it didn’t appear to load • Debug log should show what’s going on, except debug isn’t turned on till after this load • Ended up putting into ~/.pigrc but Pig warns that it should go into conf/pig.properties, a file that isn’t read • Schemas and UDFs are picky, use trial and error
    • 25. Agenda Point 3 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
    • 26. Returning a Tuple from a UDF • Sometimes you want to return more than one thing from a function • For example an expensive calculation was done and its results can be reused • But what should be returned? • Of course a Tuple • “tuple” is the answer 92% of the time http://tuplemusic.org/ Tuple is dedicated to exploring and expanding the contemporary repertoire for two bassoons
    • 27. BestBook: returns the highest scored book $ cat src/test/resources/bookscores book1 aaa 1 book1 bbb 3 Want output of that for book1 ccc 12 book3 reviewer bbb was book2 aaa 4 the highest at 5 book2 bbb 1 book3 ccc 1 book3 bbb 5
    • 28. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); }
    • 29. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } The inputs are bag “columns”
    • 30. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } return a Tuple that’s just like the inputs
    • 31. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Schema outputSchema(Schema p_input) { try { return Schema.generateNestedSchema(DataType.TUPLE, DataType.CHARARRAY, DataType.INTEGER); } catch (FrontendException e) { throw new IllegalStateException(e); } } } How to define the outbound schema inside the Tuple
    • 32. BestBook: returns the highest scored book REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; A = LOAD '$dir/bookscores' as (name : chararray, reviewer : chararray, score : int); B = group A by name; describe B; dump B; C = FOREACH B GENERATE group, BestBook(A.reviewer, A.score) as reviewandscore; describe C; dump C;
    • 33. BestBook: returns the highest scored book B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}} (book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)}) (book2,{(book2,aaa,4),(book2,bbb,1)}) (book3,{(book3,ccc,1),(book3,bbb,5)}) C: {group: chararray,reviewandscore: (chararray,int)} (book1,(ccc,12)) (book2,(aaa,4)) (book3,(bbb,5))
    • 34. BestBook: improve by implementing Algebraic •If EvalFunc can be run in stages and summed up consider implementing Algebraic •Three methods to override: •String getInitial(); •String getIntermed(); •String getFinal() •See COUNT and DoubleAvg
    • 35. FilterFunc: a filter that’s an EvalFunc • For keeping and disgarding entries write a filter • FilterFunc extends EvalFunc<Boolean> • Adds a method “void finish()” for cleanup • Example: only wants dates that are within 10 minutes of one another
    • 36. FilterFunc: DateWithinFilter public class DateWithinFilter extends FilterFunc { @Override public Boolean exec(Tuple input) throws IOException { if (input.size() != 3) { throw new IOException(“error msg”); } Date[] startAndTryDates = getColumnDates(input); if (startAndTryDates == null) return false; long dateDiff = startAndTryDates[1].getTime() - startAndTryDates[0].getTime(); if (dateDiff < 0) { return false; // maybe make optional } int maxDateDiff = (Integer) input.get(2); return dateDiff <= maxDateDiff; }
    • 37. FilterFunc: DateWithinFilter private Date[] getColumnDates(Tuple input) throws ExecException { String strDate1 = (String) input.get(0); String strDate2 = (String) input.get(1); if (strDate1 == null || strDate2 == null) { return null; } Date date1 = null; try { date1 = df.parse(strDate1); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } Date date2 = null; try { date2 = df.parse(strDate2); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } return new Date[] { date1, date2 }; }
    • 38. FilterFunc: DateWithinFilter @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcList.add(new FuncSpec(this.getClass().getName(), s)); return funcList; } Defining what inputs we accept stay tuned for what happens when violated
    • 39. FilterFunc: DateWithinFilter $ cat src/test/resources/purchasetimes 1234 2010-06-01 10:31:22 2010-06-01 10:32:22 7121 2010-06-01 10:30:18 2010-06-01 11:02:59 1234 2010-06-01 10:40:18 2010-06-01 10:45:32 7681 lol wut 4532 2010-06-01 11:37:18 2010-06-01 11:42:59 $ cat src/test/resources/purchasetimes.pig REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; purchasetimes = LOAD '$dir/purchasetimes' AS (userid: int, datein: chararray, dateout: chararray); quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout, 600000); DUMP quickybuyers; $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ (1234,2010-06-01 10:31:22,2010-06-01 10:32:22) (1234,2010-06-01 10:40:18,2010-06-01 10:45:32) (4532,2010-06-01 11:37:18,2010-06-01 11:42:59)
    • 40. EvalFunc: not passing in correct number args $ cat src/test/resources/purchasetimes.pig quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout); $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ 2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit. Please use an explicit cast. Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop- demo-code/pig_1276820742917.log log file has: at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi sitor.java:1197) so error caught before loading data
    • 41. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
    • 42. LoadFunc: definition • How does something get loaded into Pig? • A = load ‘B’; • But what is actually going on? • A = load ‘B’ using PigStorage(); • PigStorage is a LoadFunc that reads off of disk and splits on tab to create a Tuple
    • 43. LoadFunc: definition • LoadFunc is an interface with a number of methods, the most interesting being • bindTo(fileName,inputStream,offset,end) • Tuple getNext() • Extend from UTF8StorageConverter like PigStorage to get defaults • Overview: PigStorage’s getNext() creates an array of objects after splitting on a tab and puts those into a Tuple
    • 44. LoadFunc: make our own • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those
    • 45. LoadFunc: make our own • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those • Goal: • A = LOAD 'urls' USING QuerystringLoader('query', 'userid') AS (query: chararray, userid : int);
    • 46. LoadFunc: QuerystringLoader • Passing in constructor arguments from the pig script is easy: • public QuerystringLoader(String... fieldNames) • bindTo is almost exactly the same as the PigStorage one, using the PigLineRecordReader to parse the InputStream • Tuple getTuple() is where the action happens • parse the querystring into a Map • loop through the fields given in the constructor • return a Tuple of a list of those objects
    • 47. LoadFunc: QuerystringLoader getTuple() @Override public Tuple getNext() throws IOException { if (in == null || in.getPosition() > end) { return null; } Text value = new Text(); boolean notDone = in.next(value); if (!notDone) { return null; } Map<String, Object> parameters = getParameterMap(value.toString()); List<String> output = new ArrayList<String>(); for (String fieldName : m_fieldsInOrder) { Object object = parameters.get(fieldName); if (object == null) { output.add(null); continue; } if (object instanceof String) { output.add((String) object); } else { List<String> objectVal = (List<String>) object; output.add(objectVal.get(0)); } } return mTupleFactory.newTupleNoCopy(output); }
    • 48. LoadFunc: notes • boolean okay=in.next(tuple) is how to get the next parsed line • getParameterMap(url) splits querystring into a Map<String,Object> • Pig handles type conversion for you, just hand back a Tuple. • In this case the Tuple can be made up of anything so user specifies the schema in the script • AS (query:chararray, userid:int)
    • 49. RegexLoader Same concept, pass in a Pattern for the constructor and have getTuple() return only the matched parts @Override public Tuple getNext() throws IOException { Matcher m = m_linePattern.matcher(value.toString()); if (!m.matches()) { return EmptyTuple.getInstance(); } List<String> regexMatches = new ArrayList<String>(); for (int i = 1; i <= m.groupCount(); i++) { regexMatches.add(m.group(i)); } return mTupleFactory.newTupleNoCopy(regexMatches); }
    • 50. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
    • 51. Piggybank • CVS repository of common UDFs • Excited about it at first, doesn’t appear to be used that much • Needs to be an easier way of doing this • CPAN (Perl) for Pig would be great • register pigpan://Math::FFT • brings down the jars from a maven-like repository and tells pig where to load from • any takers? Looking into it
    • 52. Bonus section: unit testing @Test public void testRepeatQueryParams() throws IOException { String url = "http://localhost/foo?a=123&a=456nx=y nhttp://localhost/bar?a=761&b=hi"; QuerystringLoader loader = new QuerystringLoader("a", "b"); InputStream in = new ByteArrayInputStream(url.getBytes()); loader.bindTo(null, new BufferedPositionedInputStream(in), 0, url.length()); Tuple tuple = loader.getNext(); assertEquals("123", (String) tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals(2, tuple.size()); assertNull(tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals("761", (String) tuple.get(0)); assertEquals("hi", (String) tuple.get(1)); }
    • 53. Resources UDF reference: http://hadoop.apache.org/pig/docs/r0.5.0/ piglatin_reference.html Code samples: http://github.com/seattlehadoop Presentation: http://www.slideshare.net/seattlehadoop

    ×