• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Intro to Pig UDF
 

Intro to Pig UDF

on

  • 18,451 views

Introduction on writing pig UDFs

Introduction on writing pig UDFs

Statistics

Views

Total Views
18,451
Views on SlideShare
17,606
Embed Views
845

Actions

Likes
20
Downloads
382
Comments
1

20 Embeds 845

http://dschool.co 493
http://localhost 168
http://www.dschool.co 131
http://iitkgpsv.org 26
http://jcc.dschool.co 6
http://www.iitkgpsv.org 4
https://si0.twimg.com 2
http://mountcarmelschoolpatna.dschool.co 2
http://kvdanapur.dschool.co 2
http://algebrahub.com 1
http://www.surendranathcentenary.dschool.co 1
http://bwgsdoranda.dschool.co 1
http://josephconventpatna.dschool.co 1
http://www.slideshare.net 1
http://pcspatna.dschool.co 1
http://patnastmichael.dschool.co 1
https://twitter.com 1
https://twimg0-a.akamaihd.net 1
http://twitter.com 1
http://sbe-ridgecrestcharterkern.dschool.co 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />

Intro to Pig UDF Intro to Pig UDF Presentation Transcript

  • Introduction to Pig UDFs Chris Wilkes cwilkes@seattlehadoop.org
  • Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • Agenda Point 1 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • What is a UDF? User Defined Function • Way to do an operation on a field or fields • Note: not on the group • Called from within a pig script • b = FOREACH a GENERATE foo(color) • Currently all done in java
  • Why use a UDF? • You need to do more than grouping or filtering • Actually filtering is a UDF • Probably using them already • Maybe more comfortable in java land than in SQL / Pig Latin
  • How to write an use? • Just extend / implement an interface • No need for administrator rights, just call your script • Very simple java, just think about your small problem Magical Powers not required
  • Moving right along Now to the informative part of the talk
  • Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • EvalFunc : probably what you need to do •Easiest to understand: takes one or more fields and spits back a generic object •Extend the EvalFunc interface and it practically writes itself •Let’s look at the UPPER example from the piggybank
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } modified version from the piggybank SVN
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The generic <String> tells Pig what class will be returned from this method
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The Tuple input contains the fields within the script ()
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Check your inputs for empties or nulls
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } You have to know that the 1st parameter inside the tuple is a String
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Catch errors that are acceptable and return null so can be skipped over
  • The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public List<FuncSpec> getArgToFuncMapping() { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY)))); return funcList; } } Tells Pig what parameters this function takes
  • Recap of UPPER • Generics outlines contract for return type • Schemas are preserved (chararray / String) • Check inputs for empty or null • Return null if item should be skipped • Throw an exception if deadly • Name “UPPER” can be used if known to PigContext’s packageImportList, otherwise need full classname • Cast items inside of the Tuple parameter
  • Another simple EvalFunc: AstroDist • Two input files: planet names with coordinates and pairs of planets • Goal: find the distance between the pairs • Loading is slightly different: coords in a tuple • Input to EvalFunc is a Tuple that contains a Tuple
  • AstroDist input files $ cat src/test/resources/cosmo aaa bbb aaa ccc ddd aaa $ cat src/test/resources/planets aaa (1,0,10) bbb (2,-5,15) ccc (-7,12,48) image from xkcd.com ddd (3,3,8)
  • AstroDist pig script REGISTER target/pig-demo-1.0-SNAPSHOT.jar; planets = load '$dir/planets' as (name : chararray, l:tuple(x : int, y : int, z : int)); cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray); A = JOIN cosmo BY planet1, planets BY name; B = JOIN A by planet2, planets BY name; locations = FOREACH B GENERATE $1 AS p1name:chararray, $2 AS p2name : chararray, AstroDist($3,$5) as distance; dump locations;
  • AstroDist output $ pig -x local -f src/test/resources/distances.pig -param dir=src/test/resources/ What B looks like: (ddd,aaa,ddd,(3,3,8),aaa,(1,0,10)) (aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15)) (aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48)) Output: (aaa,ddd,4.123105625617661) (bbb,aaa,7.14142842854285) (ccc,aaa,40.64480286580315)
  • AstroDist program public class AstroDist extends EvalFunc<Double> { @Override public Double exec(Tuple input) throws IOException { Point3D astroPos1 = new Point3D((Tuple) input.get(0)); Point3D astroPos2 = new Point3D((Tuple) input.get(1)); return astroPos1.distance(astroPos2); } @Override public List<FuncSpec> getArgToFuncMapping() { Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); return Arrays.asList( new FuncSpec(this.getClass().getName(), s)); } }
  • AstroDist program (cont) private static class Point3D { private final int x, y, z; private Point3D(Tuple tuple) throws ExecException { if (tuple.size() != 3) { throw new ExecException("Received " + tuple.size() + " points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG); } x = (Integer) tuple.get(0); y = (Integer) tuple.get(1); z = (Integer) tuple.get(2); } private double distance(Point3D other) { return Math.sqrt(Math.pow(x - other.x, 2) + Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2)); } }
  • Fun times when running this script • Looking through PigContext and Main found that /pig.properties in the classpath is parsed for the key/value “udf.import.list” • Put this into my jar (src/main/resources with maven) but it didn’t appear to load • Debug log should show what’s going on, except debug isn’t turned on till after this load • Ended up putting into ~/.pigrc but Pig warns that it should go into conf/pig.properties, a file that isn’t read • Schemas and UDFs are picky, use trial and error
  • Agenda Point 3 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • Returning a Tuple from a UDF • Sometimes you want to return more than one thing from a function • For example an expensive calculation was done and its results can be reused • But what should be returned? • Of course a Tuple • “tuple” is the answer 92% of the time http://tuplemusic.org/ Tuple is dedicated to exploring and expanding the contemporary repertoire for two bassoons
  • BestBook: returns the highest scored book $ cat src/test/resources/bookscores book1 aaa 1 book1 bbb 3 Want output of that for book1 ccc 12 book3 reviewer bbb was book2 aaa 4 the highest at 5 book2 bbb 1 book3 ccc 1 book3 bbb 5
  • BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); }
  • BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } The inputs are bag “columns”
  • BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } return a Tuple that’s just like the inputs
  • BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Schema outputSchema(Schema p_input) { try { return Schema.generateNestedSchema(DataType.TUPLE, DataType.CHARARRAY, DataType.INTEGER); } catch (FrontendException e) { throw new IllegalStateException(e); } } } How to define the outbound schema inside the Tuple
  • BestBook: returns the highest scored book REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; A = LOAD '$dir/bookscores' as (name : chararray, reviewer : chararray, score : int); B = group A by name; describe B; dump B; C = FOREACH B GENERATE group, BestBook(A.reviewer, A.score) as reviewandscore; describe C; dump C;
  • BestBook: returns the highest scored book B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}} (book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)}) (book2,{(book2,aaa,4),(book2,bbb,1)}) (book3,{(book3,ccc,1),(book3,bbb,5)}) C: {group: chararray,reviewandscore: (chararray,int)} (book1,(ccc,12)) (book2,(aaa,4)) (book3,(bbb,5))
  • BestBook: improve by implementing Algebraic •If EvalFunc can be run in stages and summed up consider implementing Algebraic •Three methods to override: •String getInitial(); •String getIntermed(); •String getFinal() •See COUNT and DoubleAvg
  • FilterFunc: a filter that’s an EvalFunc • For keeping and disgarding entries write a filter • FilterFunc extends EvalFunc<Boolean> • Adds a method “void finish()” for cleanup • Example: only wants dates that are within 10 minutes of one another
  • FilterFunc: DateWithinFilter public class DateWithinFilter extends FilterFunc { @Override public Boolean exec(Tuple input) throws IOException { if (input.size() != 3) { throw new IOException(“error msg”); } Date[] startAndTryDates = getColumnDates(input); if (startAndTryDates == null) return false; long dateDiff = startAndTryDates[1].getTime() - startAndTryDates[0].getTime(); if (dateDiff < 0) { return false; // maybe make optional } int maxDateDiff = (Integer) input.get(2); return dateDiff <= maxDateDiff; }
  • FilterFunc: DateWithinFilter private Date[] getColumnDates(Tuple input) throws ExecException { String strDate1 = (String) input.get(0); String strDate2 = (String) input.get(1); if (strDate1 == null || strDate2 == null) { return null; } Date date1 = null; try { date1 = df.parse(strDate1); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } Date date2 = null; try { date2 = df.parse(strDate2); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } return new Date[] { date1, date2 }; }
  • FilterFunc: DateWithinFilter @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcList.add(new FuncSpec(this.getClass().getName(), s)); return funcList; } Defining what inputs we accept stay tuned for what happens when violated
  • FilterFunc: DateWithinFilter $ cat src/test/resources/purchasetimes 1234 2010-06-01 10:31:22 2010-06-01 10:32:22 7121 2010-06-01 10:30:18 2010-06-01 11:02:59 1234 2010-06-01 10:40:18 2010-06-01 10:45:32 7681 lol wut 4532 2010-06-01 11:37:18 2010-06-01 11:42:59 $ cat src/test/resources/purchasetimes.pig REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; purchasetimes = LOAD '$dir/purchasetimes' AS (userid: int, datein: chararray, dateout: chararray); quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout, 600000); DUMP quickybuyers; $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ (1234,2010-06-01 10:31:22,2010-06-01 10:32:22) (1234,2010-06-01 10:40:18,2010-06-01 10:45:32) (4532,2010-06-01 11:37:18,2010-06-01 11:42:59)
  • EvalFunc: not passing in correct number args $ cat src/test/resources/purchasetimes.pig quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout); $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ 2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit. Please use an explicit cast. Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop- demo-code/pig_1276820742917.log log file has: at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi sitor.java:1197) so error caught before loading data
  • Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • LoadFunc: definition • How does something get loaded into Pig? • A = load ‘B’; • But what is actually going on? • A = load ‘B’ using PigStorage(); • PigStorage is a LoadFunc that reads off of disk and splits on tab to create a Tuple
  • LoadFunc: definition • LoadFunc is an interface with a number of methods, the most interesting being • bindTo(fileName,inputStream,offset,end) • Tuple getNext() • Extend from UTF8StorageConverter like PigStorage to get defaults • Overview: PigStorage’s getNext() creates an array of objects after splitting on a tab and puts those into a Tuple
  • LoadFunc: make our own • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those
  • LoadFunc: make our own • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those • Goal: • A = LOAD 'urls' USING QuerystringLoader('query', 'userid') AS (query: chararray, userid : int);
  • LoadFunc: QuerystringLoader • Passing in constructor arguments from the pig script is easy: • public QuerystringLoader(String... fieldNames) • bindTo is almost exactly the same as the PigStorage one, using the PigLineRecordReader to parse the InputStream • Tuple getTuple() is where the action happens • parse the querystring into a Map • loop through the fields given in the constructor • return a Tuple of a list of those objects
  • LoadFunc: QuerystringLoader getTuple() @Override public Tuple getNext() throws IOException { if (in == null || in.getPosition() > end) { return null; } Text value = new Text(); boolean notDone = in.next(value); if (!notDone) { return null; } Map<String, Object> parameters = getParameterMap(value.toString()); List<String> output = new ArrayList<String>(); for (String fieldName : m_fieldsInOrder) { Object object = parameters.get(fieldName); if (object == null) { output.add(null); continue; } if (object instanceof String) { output.add((String) object); } else { List<String> objectVal = (List<String>) object; output.add(objectVal.get(0)); } } return mTupleFactory.newTupleNoCopy(output); }
  • LoadFunc: notes • boolean okay=in.next(tuple) is how to get the next parsed line • getParameterMap(url) splits querystring into a Map<String,Object> • Pig handles type conversion for you, just hand back a Tuple. • In this case the Tuple can be made up of anything so user specifies the schema in the script • AS (query:chararray, userid:int)
  • RegexLoader Same concept, pass in a Pattern for the constructor and have getTuple() return only the matched parts @Override public Tuple getNext() throws IOException { Matcher m = m_linePattern.matcher(value.toString()); if (!m.matches()) { return EmptyTuple.getInstance(); } List<String> regexMatches = new ArrayList<String>(); for (int i = 1; i <= m.groupCount(); i++) { regexMatches.add(m.group(i)); } return mTupleFactory.newTupleNoCopy(regexMatches); }
  • Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • Piggybank • CVS repository of common UDFs • Excited about it at first, doesn’t appear to be used that much • Needs to be an easier way of doing this • CPAN (Perl) for Pig would be great • register pigpan://Math::FFT • brings down the jars from a maven-like repository and tells pig where to load from • any takers? Looking into it
  • Bonus section: unit testing @Test public void testRepeatQueryParams() throws IOException { String url = "http://localhost/foo?a=123&a=456nx=y nhttp://localhost/bar?a=761&b=hi"; QuerystringLoader loader = new QuerystringLoader("a", "b"); InputStream in = new ByteArrayInputStream(url.getBytes()); loader.bindTo(null, new BufferedPositionedInputStream(in), 0, url.length()); Tuple tuple = loader.getNext(); assertEquals("123", (String) tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals(2, tuple.size()); assertNull(tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals("761", (String) tuple.get(0)); assertEquals("hi", (String) tuple.get(1)); }
  • Resources UDF reference: http://hadoop.apache.org/pig/docs/r0.5.0/ piglatin_reference.html Code samples: http://github.com/seattlehadoop Presentation: http://www.slideshare.net/seattlehadoop