Introduction to
   Pig UDFs

       Chris Wilkes
cwilkes@seattlehadoop.org
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
Agenda Point 1


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
What is a UDF?



  User Defined Function

  • Way to do an operation on a field or fields
  • Note: not on the group
  • Called from within a pig script
  • b = FOREACH a GENERATE foo(color)
  • Currently all done in java
Why use a UDF?



  • You need to do more than grouping or
   filtering
  • Actually filtering is a UDF
  • Probably using them already
  • Maybe more comfortable in java land
   than in SQL / Pig Latin
How to write an use?




 • Just extend / implement an
   interface
 • No need for administrator
   rights, just call your script
 • Very simple java, just think
   about your small problem

                                   Magical Powers not required
Moving right along




      Now to the informative part of the talk
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
EvalFunc : probably what you need to do



  •Easiest to understand: takes one or more
   fields and spits back a generic object
  •Extend the EvalFunc interface and it
   practically writes itself
  •Let’s look at the UPPER example from the
   piggybank
The UPPER EvalFunc

 public class UPPER extends EvalFunc<String> {

     @Override
     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }



         modified version from the piggybank SVN
The UPPER EvalFunc

 public class UPPER   extends EvalFunc<String> {

     @Override
      public String exec(Tuple input) throws IOException {
         if (input == null||input.size() == 0||input.get(0) == null)
            return null;
         try {
            return ((String)input.get(0)).toUpperCase();
         } catch (ClassCastException e) {
            warn(“error msg”, PigWarning.UDF_WARNING_1);
         } catch(Exception e){
            warn("Error”, PigWarning.UDF_WARNING_1);
         }
         return null;
      }

 }

     The generic <String> tells Pig what class will be
              returned from this method
The UPPER EvalFunc

 public class UPPER extends EvalFunc<String> {

     @Override
      public String exec(Tuple input) throws IOException {
         if (input == null||input.size() == 0||input.get(0) == null)
            return null;
         try {
            return ((String)input.get(0)).toUpperCase();
         } catch (ClassCastException e) {
            warn(“error msg”, PigWarning.UDF_WARNING_1);
         } catch(Exception e){
            warn("Error”, PigWarning.UDF_WARNING_1);
         }
         return null;
      }

 }


 The Tuple input contains the fields within the script ()
The UPPER EvalFunc


 public class UPPER extends EvalFunc<String> {

     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }



          Check your inputs for empties or nulls
The UPPER EvalFunc


 public class UPPER extends EvalFunc<String> {

     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }


 You have to know that the 1st parameter inside the
                 tuple is a String
The UPPER EvalFunc


 public class UPPER extends EvalFunc<String> {

     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }

  Catch errors that are acceptable and return null so
                 can be skipped over
The UPPER EvalFunc




 public class UPPER extends EvalFunc<String> {
    public List<FuncSpec> getArgToFuncMapping() {
       List<FuncSpec> funcList = new ArrayList<FuncSpec>();
       funcList.add(new FuncSpec(this.getClass().getName(),
          new Schema(new Schema.FieldSchema(null,
             DataType.CHARARRAY))));
       return funcList;
    }
 }




   Tells Pig what parameters this function takes
Recap of UPPER


 • Generics outlines contract for return type
 • Schemas are preserved (chararray / String)
 • Check inputs for empty or null
 • Return null if item should be skipped
     • Throw an exception if deadly
 • Name “UPPER” can be used if known to
   PigContext’s packageImportList, otherwise need
   full classname
 • Cast items inside of the Tuple parameter
Another simple EvalFunc: AstroDist




 • Two input files: planet names with coordinates
   and pairs of planets
 • Goal: find the distance between the pairs
 • Loading is slightly different: coords in a tuple
 • Input to EvalFunc is a Tuple that contains a Tuple
AstroDist input files

$ cat src/test/resources/cosmo
aaa bbb
aaa ccc
ddd aaa

$ cat src/test/resources/planets
aaa (1,0,10)
bbb (2,-5,15)
ccc (-7,12,48)                     image from xkcd.com
ddd (3,3,8)
AstroDist pig script

 REGISTER target/pig-demo-1.0-SNAPSHOT.jar;

 planets = load '$dir/planets' as (name : chararray,
               l:tuple(x : int, y : int, z : int));
 cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray);

 A = JOIN cosmo BY planet1, planets BY name;
 B = JOIN A by planet2, planets BY name;

 locations = FOREACH B GENERATE
   $1 AS p1name:chararray,
   $2 AS p2name : chararray,
   AstroDist($3,$5) as distance;

 dump locations;
AstroDist output


   $ pig -x local -f src/test/resources/distances.pig
    -param dir=src/test/resources/

   What B looks like:
   (ddd,aaa,ddd,(3,3,8),aaa,(1,0,10))
   (aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15))
   (aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48))

   Output:
   (aaa,ddd,4.123105625617661)
   (bbb,aaa,7.14142842854285)
   (ccc,aaa,40.64480286580315)
AstroDist program


 public class AstroDist extends EvalFunc<Double> {
   @Override
    public Double exec(Tuple input) throws IOException {
      Point3D astroPos1 = new Point3D((Tuple) input.get(0));
      Point3D astroPos2 = new Point3D((Tuple) input.get(1));
      return astroPos1.distance(astroPos2);
    }
    @Override
    public List<FuncSpec> getArgToFuncMapping() {
      Schema s = new Schema();
      s.add(new Schema.FieldSchema(null, DataType.TUPLE));
      s.add(new Schema.FieldSchema(null, DataType.TUPLE));
      return Arrays.asList(
        new FuncSpec(this.getClass().getName(), s));
    }
 }
AstroDist program (cont)


 private static class Point3D {
   private final int x, y, z;
   private Point3D(Tuple tuple) throws ExecException {
 
 
 if (tuple.size() != 3) {
 
 
 
 throw new ExecException("Received " + tuple.size() +
   " points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG);
 
 
 }
 
 
 x = (Integer) tuple.get(0);
 
 
 y = (Integer) tuple.get(1);
 
 
 z = (Integer) tuple.get(2);
 
 }
 
 private double distance(Point3D other) {
 
 
 return Math.sqrt(Math.pow(x - other.x, 2) +
        Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2));
 
 }
 }
Fun times when running this script

  • Looking through PigContext and Main found
    that /pig.properties in the classpath is parsed for
    the key/value “udf.import.list”
  • Put this into my jar (src/main/resources with
    maven) but it didn’t appear to load
  • Debug log should show what’s going on, except
    debug isn’t turned on till after this load
  • Ended up putting into ~/.pigrc but Pig warns that
    it should go into conf/pig.properties, a file that
    isn’t read
  • Schemas and UDFs are picky, use trial and error
Agenda Point 3


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
Returning a Tuple from a UDF


 • Sometimes you want to return more than one
   thing from a function
 • For example an expensive calculation was done
   and its results can be reused
 • But what should be returned?
   • Of course a Tuple
   • “tuple” is the answer 92% of the time


                                                   http://tuplemusic.org/
                                     Tuple is dedicated to exploring and expanding
                                     the contemporary repertoire for two bassoons
BestBook: returns the highest scored book

   $ cat src/test/resources/bookscores
   book1 aaa 1
   book1 bbb 3
                               Want output of that for
   book1 ccc 12
                              book3 reviewer bbb was
   book2 aaa 4
                                      the highest at 5
   book2 bbb 1
   book3 ccc 1
   book3 bbb 5
BestBook EvalFunc

 public class BestBook extends EvalFunc<Tuple> {

 
   @Override
 
   public Tuple exec(Tuple p_input) throws IOException {
 
   
 Iterator<Tuple> bagReviewers =
          ((DataBag) p_input.get(0)).iterator();
 
   
 Iterator<Tuple> bagScores =
          ((DataBag) p_input.get(1)).iterator();
 
   
 int bestScore = -1;
 
   
 String bestReviewer = null;
 
   
 while (bagReviewers.hasNext() && bagScores.hasNext()) {
 
   
 
 String reviewerName = (String) bagReviewers.next().get(0);
 
   
 
 Integer score = (Integer) bagScores.next().get(0);
 
   
 
 if (score.intValue() > bestScore) {
 
   
 
 
 bestScore = score;
 
   
 
 
 bestReviewer = reviewerName;
 
   
 
 }
 
   
 }
 
   
 return TupleFactory.getInstance().newTuple(
          Arrays.asList(bestReviewer, (Integer) bestScore));
 
   }
BestBook EvalFunc

 public class BestBook extends EvalFunc<Tuple> {

 
   @Override
 
   public Tuple exec(Tuple p_input) throws IOException {
 
   
 Iterator<Tuple> bagReviewers =
          ((DataBag) p_input.get(0)).iterator();
 
   
 Iterator<Tuple> bagScores =
          ((DataBag) p_input.get(1)).iterator();
 
   
 int bestScore = -1;
 
   
 String bestReviewer = null;
 
   
 while (bagReviewers.hasNext() && bagScores.hasNext()) {
 
   
 
 String reviewerName = (String) bagReviewers.next().get(0);
 
   
 
 Integer score = (Integer) bagScores.next().get(0);
 
   
 
 if (score.intValue() > bestScore) {
 
   
 
 
 bestScore = score;
 
   
 
 
 bestReviewer = reviewerName;
 
   
 
 }
 
   
 }
 
   
 return TupleFactory.getInstance().newTuple(
          Arrays.asList(bestReviewer, (Integer) bestScore));
 
   }
                      The inputs are bag “columns”
BestBook EvalFunc

 public class BestBook extends EvalFunc<Tuple> {

 
   @Override
 
   public Tuple exec(Tuple p_input) throws IOException {
 
   
 Iterator<Tuple> bagReviewers =
          ((DataBag) p_input.get(0)).iterator();
 
   
 Iterator<Tuple> bagScores =
          ((DataBag) p_input.get(1)).iterator();
 
   
 int bestScore = -1;
 
   
 String bestReviewer = null;
 
   
 while (bagReviewers.hasNext() && bagScores.hasNext()) {
 
   
 
 String reviewerName = (String) bagReviewers.next().get(0);
 
   
 
 Integer score = (Integer) bagScores.next().get(0);
 
   
 
 if (score.intValue() > bestScore) {
 
   
 
 
 bestScore = score;
 
   
 
 
 bestReviewer = reviewerName;
 
   
 
 }
 
   
 }
 
   
 return TupleFactory.getInstance().newTuple(
          Arrays.asList(bestReviewer, (Integer) bestScore));
 
   }
                 return a Tuple that’s just like the inputs
BestBook EvalFunc




 public class BestBook extends EvalFunc<Tuple> {

 
 @Override
 
 public Schema outputSchema(Schema p_input) {
 
 
 try {
 
 
 
 return Schema.generateNestedSchema(DataType.TUPLE,
 DataType.CHARARRAY, DataType.INTEGER);
 
 
 } catch (FrontendException e) {
 
 
 
 throw new IllegalStateException(e);
 
 
 }
 
 }
 }



                     How to define the outbound
                       schema inside the Tuple
BestBook: returns the highest scored book


   REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;

   A = LOAD '$dir/bookscores' as (name : chararray,
     reviewer : chararray, score : int);

   B = group A by name;
   describe B;
   dump B;

   C = FOREACH B GENERATE group,
       BestBook(A.reviewer, A.score) as reviewandscore;

   describe C;
   dump C;
BestBook: returns the highest scored book



B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}}
(book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)})
(book2,{(book2,aaa,4),(book2,bbb,1)})
(book3,{(book3,ccc,1),(book3,bbb,5)})

C: {group: chararray,reviewandscore: (chararray,int)}
(book1,(ccc,12))
(book2,(aaa,4))
(book3,(bbb,5))
BestBook: improve by implementing Algebraic



 •If EvalFunc can be run in stages and summed up consider
 implementing Algebraic
 •Three methods to override:
   •String getInitial();
   •String getIntermed();
   •String getFinal()
 •See COUNT and DoubleAvg
FilterFunc: a filter that’s an EvalFunc




  • For keeping and disgarding entries write a filter
  • FilterFunc extends EvalFunc<Boolean>
    • Adds a method “void finish()” for cleanup
  • Example: only wants dates that are within 10
   minutes of one another
FilterFunc: DateWithinFilter


 public class DateWithinFilter extends FilterFunc {

 @Override
 
 public Boolean exec(Tuple input) throws IOException {
 
 
 if (input.size() != 3) {
 
 
 
 throw new IOException(“error msg”);
 
 
 }
 
 
 Date[] startAndTryDates = getColumnDates(input);
 
 
 if (startAndTryDates == null)
 
 
 
 return false;
 
 
 long dateDiff = startAndTryDates[1].getTime() -
         startAndTryDates[0].getTime();
 
 
 if (dateDiff < 0) {
 
 
 
 return false; // maybe make optional
 
 
 }
 
 
 int maxDateDiff = (Integer) input.get(2);
 
 
 return dateDiff <= maxDateDiff;
 
 }
FilterFunc: DateWithinFilter
 private Date[] getColumnDates(Tuple input) throws ExecException {
 
 
 String strDate1 = (String) input.get(0);
 
 
 String strDate2 = (String) input.get(1);
 
 
 if (strDate1 == null || strDate2 == null) {
 
 
 
 return null;
 
 
 }
 
 
 Date date1 = null;
 
 
 try {
 
 
 
 date1 = df.parse(strDate1);
 
 
 } catch (ParseException e) {
 
 
 
 warn(“date format err”, PigWarning.UDF_WARNING_1);
 
 
 
 return null;
 
 
 }
 
 
 Date date2 = null;
 
 
 try {
 
 
 
 date2 = df.parse(strDate2);
 
 
 } catch (ParseException e) {
 
 
 
 warn(“date format err”, PigWarning.UDF_WARNING_1);
 
 
 
 return null;
 
 
 }
 
 
 return new Date[] { date1, date2 };
 
 }
FilterFunc: DateWithinFilter

 @Override
 
 public List<FuncSpec> getArgToFuncMapping() throws
 FrontendException {
 
 
 List<FuncSpec> funcList = new ArrayList<FuncSpec>();
 
 
 Schema s = new Schema();
 
 
 s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
 
 
 s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
 
 
 s.add(new Schema.FieldSchema(null, DataType.INTEGER));
 
 
 funcList.add(new FuncSpec(this.getClass().getName(), s));
 
 
 return funcList;
 
 }




            Defining what inputs we accept
        stay tuned for what happens when violated
FilterFunc: DateWithinFilter
$ cat src/test/resources/purchasetimes
1234 2010-06-01 10:31:22 2010-06-01 10:32:22
7121 2010-06-01 10:30:18 2010-06-01 11:02:59
1234 2010-06-01 10:40:18 2010-06-01 10:45:32
7681 lol wut
4532 2010-06-01 11:37:18 2010-06-01 11:42:59

$ cat src/test/resources/purchasetimes.pig
REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;
purchasetimes = LOAD '$dir/purchasetimes' AS
  (userid: int, datein: chararray, dateout: chararray);
quickybuyers = FILTER purchasetimes BY
  DateWithinFilter(datein, dateout, 600000);
DUMP quickybuyers;                 $ pig -x local -f src/test/resources/purchasetimes.pig
                                       -param dir=src/test/resources/
                                    (1234,2010-06-01 10:31:22,2010-06-01 10:32:22)
                                    (1234,2010-06-01 10:40:18,2010-06-01 10:45:32)
                                    (4532,2010-06-01 11:37:18,2010-06-01 11:42:59)
EvalFunc: not passing in correct number args
$ cat src/test/resources/purchasetimes.pig
quickybuyers = FILTER purchasetimes BY
  DateWithinFilter(datein, dateout);

$ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/

2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1045: Could not infer the matching function for
org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit.
Please use an explicit cast.
Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop-
demo-code/pig_1276820742917.log

log file has:
       at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi
sitor.java:1197)
so error caught before loading data
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
LoadFunc: definition




  • How does something get loaded into Pig?
  • A = load ‘B’;
  • But what is actually going on?
  • A = load ‘B’ using PigStorage();
  • PigStorage is a LoadFunc that reads off of disk
   and splits on tab to create a Tuple
LoadFunc: definition


  •  LoadFunc is an interface with a number of
    methods, the most interesting being
    • bindTo(fileName,inputStream,offset,end)
    • Tuple getNext()
  • Extend from UTF8StorageConverter like
    PigStorage to get defaults
  • Overview: PigStorage’s getNext() creates an array
    of objects after splitting on a tab and puts those
    into a Tuple
LoadFunc: make our own



 • Have a lot of log files, some just contain a URL
   • http://example.com?use=mind+bullets&target=yak
 • Want to load URLs and do analysis
 • Write your own LoadFunc to do this that takes a
   URL and returns a Map of the query parameters
 • Know what parameters you care about, only look
   for those
LoadFunc: make our own

 • Have a lot of log files, some just contain a URL
   • http://example.com?use=mind+bullets&target=yak
 • Want to load URLs and do analysis
 • Write your own LoadFunc to do this that takes a
   URL and returns a Map of the query parameters
 • Know what parameters you care about, only look
   for those
 • Goal:
 • A = LOAD 'urls' USING
   QuerystringLoader('query', 'userid') AS (query:
   chararray, userid : int);
LoadFunc: QuerystringLoader

 • Passing in constructor arguments from the pig
   script is easy:
   • public QuerystringLoader(String... fieldNames)
 • bindTo is almost exactly the same as the
   PigStorage one, using the PigLineRecordReader
   to parse the InputStream
 • Tuple getTuple() is where the action happens
   • parse the querystring into a Map
   • loop through the fields given in the constructor
   • return a Tuple of a list of those objects
LoadFunc: QuerystringLoader getTuple()

 
   @Override
 
   public Tuple getNext() throws IOException {
 
   
   if (in == null || in.getPosition() > end) {
 
   
   
   return null;
 
   
   }
 
   
   Text value = new Text();
 
   
   boolean notDone = in.next(value);
 
   
   if (!notDone) {
 
   
   
   return null;
 
   
   }
 
   
   Map<String, Object> parameters = getParameterMap(value.toString());
 
   
   List<String> output = new ArrayList<String>();
 
   
   for (String fieldName : m_fieldsInOrder) {
 
   
   
   Object object = parameters.get(fieldName);
 
   
   
   if (object == null) {
 
   
   
   
   output.add(null);
 
   
   
   
   continue;
 
   
   
   }
 
   
   
   if (object instanceof String) {
 
   
   
   
   output.add((String) object);
 
   
   
   } else {
 
   
   
   
   List<String> objectVal = (List<String>) object;
 
   
   
   
   output.add(objectVal.get(0));
 
   
   
   }
 
   
   }
 
   
   return mTupleFactory.newTupleNoCopy(output);
 
   }
LoadFunc: notes



 • boolean okay=in.next(tuple) is how to get the next
   parsed line
 • getParameterMap(url) splits querystring into a
   Map<String,Object>
 • Pig handles type conversion for you, just hand back
   a Tuple.
 • In this case the Tuple can be made up of anything so
   user specifies the schema in the script
 • AS (query:chararray, userid:int)
RegexLoader


 Same concept, pass in a Pattern for the constructor
 and have getTuple() return only the matched parts
    @Override
    	 public Tuple getNext() throws IOException {
    	 	
    	   	   Matcher m = m_linePattern.matcher(value.toString());
    	   	   if (!m.matches()) {
    	   	   	 return EmptyTuple.getInstance();
    	   	   }
    	   	   List<String> regexMatches = new ArrayList<String>();
    	   	   for (int i = 1; i <= m.groupCount(); i++) {
    	   	   	 regexMatches.add(m.group(i));
    	   	   }
    	   	   return mTupleFactory.newTupleNoCopy(regexMatches);
    	   }
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
Piggybank

  • CVS repository of common UDFs
  • Excited about it at first, doesn’t appear to be
    used that much
  • Needs to be an easier way of doing this
  • CPAN (Perl) for Pig would be great
      • register pigpan://Math::FFT
  • brings down the jars from a maven-like
    repository and tells pig where to load from
  • any takers? Looking into it
Bonus section: unit testing
 @Test
 
 public void testRepeatQueryParams() throws IOException {
 
 
 String url = "http://localhost/foo?a=123&a=456nx=y
 nhttp://localhost/bar?a=761&b=hi";
 
 
 QuerystringLoader loader = new QuerystringLoader("a", "b");
 
 
 InputStream in = new ByteArrayInputStream(url.getBytes());
 
 
 loader.bindTo(null,
         new BufferedPositionedInputStream(in), 0, url.length());
 
 
 Tuple tuple = loader.getNext();
 
 
 assertEquals("123", (String) tuple.get(0));
 
 
 assertNull(tuple.get(1));
 
 
 tuple = loader.getNext();
 
 
 assertEquals(2, tuple.size());
 
 
 assertNull(tuple.get(0));
 
 
 assertNull(tuple.get(1));
 
 
 tuple = loader.getNext();
 
 
 assertEquals("761", (String) tuple.get(0));
 
 
 assertEquals("hi", (String) tuple.get(1));
 
 }
Resources


 UDF reference:
 http://hadoop.apache.org/pig/docs/r0.5.0/
 piglatin_reference.html

 Code samples:
 http://github.com/seattlehadoop

 Presentation:
 http://www.slideshare.net/seattlehadoop

Intro to Pig UDF

  • 1.
    Introduction to Pig UDFs Chris Wilkes cwilkes@seattlehadoop.org
  • 2.
    Agenda 1What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 3.
    Agenda Point 1 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 4.
    What is aUDF? User Defined Function • Way to do an operation on a field or fields • Note: not on the group • Called from within a pig script • b = FOREACH a GENERATE foo(color) • Currently all done in java
  • 5.
    Why use aUDF? • You need to do more than grouping or filtering • Actually filtering is a UDF • Probably using them already • Maybe more comfortable in java land than in SQL / Pig Latin
  • 6.
    How to writean use? • Just extend / implement an interface • No need for administrator rights, just call your script • Very simple java, just think about your small problem Magical Powers not required
  • 7.
    Moving right along Now to the informative part of the talk
  • 8.
    Agenda 1What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 9.
    EvalFunc : probablywhat you need to do •Easiest to understand: takes one or more fields and spits back a generic object •Extend the EvalFunc interface and it practically writes itself •Let’s look at the UPPER example from the piggybank
  • 10.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } modified version from the piggybank SVN
  • 11.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The generic <String> tells Pig what class will be returned from this method
  • 12.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The Tuple input contains the fields within the script ()
  • 13.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Check your inputs for empties or nulls
  • 14.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } You have to know that the 1st parameter inside the tuple is a String
  • 15.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Catch errors that are acceptable and return null so can be skipped over
  • 16.
    The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public List<FuncSpec> getArgToFuncMapping() { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY)))); return funcList; } } Tells Pig what parameters this function takes
  • 17.
    Recap of UPPER • Generics outlines contract for return type • Schemas are preserved (chararray / String) • Check inputs for empty or null • Return null if item should be skipped • Throw an exception if deadly • Name “UPPER” can be used if known to PigContext’s packageImportList, otherwise need full classname • Cast items inside of the Tuple parameter
  • 18.
    Another simple EvalFunc:AstroDist • Two input files: planet names with coordinates and pairs of planets • Goal: find the distance between the pairs • Loading is slightly different: coords in a tuple • Input to EvalFunc is a Tuple that contains a Tuple
  • 19.
    AstroDist input files $cat src/test/resources/cosmo aaa bbb aaa ccc ddd aaa $ cat src/test/resources/planets aaa (1,0,10) bbb (2,-5,15) ccc (-7,12,48) image from xkcd.com ddd (3,3,8)
  • 20.
    AstroDist pig script REGISTER target/pig-demo-1.0-SNAPSHOT.jar; planets = load '$dir/planets' as (name : chararray, l:tuple(x : int, y : int, z : int)); cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray); A = JOIN cosmo BY planet1, planets BY name; B = JOIN A by planet2, planets BY name; locations = FOREACH B GENERATE $1 AS p1name:chararray, $2 AS p2name : chararray, AstroDist($3,$5) as distance; dump locations;
  • 21.
    AstroDist output $ pig -x local -f src/test/resources/distances.pig -param dir=src/test/resources/ What B looks like: (ddd,aaa,ddd,(3,3,8),aaa,(1,0,10)) (aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15)) (aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48)) Output: (aaa,ddd,4.123105625617661) (bbb,aaa,7.14142842854285) (ccc,aaa,40.64480286580315)
  • 22.
    AstroDist program publicclass AstroDist extends EvalFunc<Double> { @Override public Double exec(Tuple input) throws IOException { Point3D astroPos1 = new Point3D((Tuple) input.get(0)); Point3D astroPos2 = new Point3D((Tuple) input.get(1)); return astroPos1.distance(astroPos2); } @Override public List<FuncSpec> getArgToFuncMapping() { Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); return Arrays.asList( new FuncSpec(this.getClass().getName(), s)); } }
  • 23.
    AstroDist program (cont) private static class Point3D { private final int x, y, z; private Point3D(Tuple tuple) throws ExecException { if (tuple.size() != 3) { throw new ExecException("Received " + tuple.size() + " points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG); } x = (Integer) tuple.get(0); y = (Integer) tuple.get(1); z = (Integer) tuple.get(2); } private double distance(Point3D other) { return Math.sqrt(Math.pow(x - other.x, 2) + Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2)); } }
  • 24.
    Fun times whenrunning this script • Looking through PigContext and Main found that /pig.properties in the classpath is parsed for the key/value “udf.import.list” • Put this into my jar (src/main/resources with maven) but it didn’t appear to load • Debug log should show what’s going on, except debug isn’t turned on till after this load • Ended up putting into ~/.pigrc but Pig warns that it should go into conf/pig.properties, a file that isn’t read • Schemas and UDFs are picky, use trial and error
  • 25.
    Agenda Point 3 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 26.
    Returning a Tuplefrom a UDF • Sometimes you want to return more than one thing from a function • For example an expensive calculation was done and its results can be reused • But what should be returned? • Of course a Tuple • “tuple” is the answer 92% of the time http://tuplemusic.org/ Tuple is dedicated to exploring and expanding the contemporary repertoire for two bassoons
  • 27.
    BestBook: returns thehighest scored book $ cat src/test/resources/bookscores book1 aaa 1 book1 bbb 3 Want output of that for book1 ccc 12 book3 reviewer bbb was book2 aaa 4 the highest at 5 book2 bbb 1 book3 ccc 1 book3 bbb 5
  • 28.
    BestBook EvalFunc publicclass BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); }
  • 29.
    BestBook EvalFunc publicclass BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } The inputs are bag “columns”
  • 30.
    BestBook EvalFunc publicclass BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } return a Tuple that’s just like the inputs
  • 31.
    BestBook EvalFunc publicclass BestBook extends EvalFunc<Tuple> { @Override public Schema outputSchema(Schema p_input) { try { return Schema.generateNestedSchema(DataType.TUPLE, DataType.CHARARRAY, DataType.INTEGER); } catch (FrontendException e) { throw new IllegalStateException(e); } } } How to define the outbound schema inside the Tuple
  • 32.
    BestBook: returns thehighest scored book REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; A = LOAD '$dir/bookscores' as (name : chararray, reviewer : chararray, score : int); B = group A by name; describe B; dump B; C = FOREACH B GENERATE group, BestBook(A.reviewer, A.score) as reviewandscore; describe C; dump C;
  • 33.
    BestBook: returns thehighest scored book B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}} (book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)}) (book2,{(book2,aaa,4),(book2,bbb,1)}) (book3,{(book3,ccc,1),(book3,bbb,5)}) C: {group: chararray,reviewandscore: (chararray,int)} (book1,(ccc,12)) (book2,(aaa,4)) (book3,(bbb,5))
  • 34.
    BestBook: improve byimplementing Algebraic •If EvalFunc can be run in stages and summed up consider implementing Algebraic •Three methods to override: •String getInitial(); •String getIntermed(); •String getFinal() •See COUNT and DoubleAvg
  • 35.
    FilterFunc: a filterthat’s an EvalFunc • For keeping and disgarding entries write a filter • FilterFunc extends EvalFunc<Boolean> • Adds a method “void finish()” for cleanup • Example: only wants dates that are within 10 minutes of one another
  • 36.
    FilterFunc: DateWithinFilter publicclass DateWithinFilter extends FilterFunc { @Override public Boolean exec(Tuple input) throws IOException { if (input.size() != 3) { throw new IOException(“error msg”); } Date[] startAndTryDates = getColumnDates(input); if (startAndTryDates == null) return false; long dateDiff = startAndTryDates[1].getTime() - startAndTryDates[0].getTime(); if (dateDiff < 0) { return false; // maybe make optional } int maxDateDiff = (Integer) input.get(2); return dateDiff <= maxDateDiff; }
  • 37.
    FilterFunc: DateWithinFilter privateDate[] getColumnDates(Tuple input) throws ExecException { String strDate1 = (String) input.get(0); String strDate2 = (String) input.get(1); if (strDate1 == null || strDate2 == null) { return null; } Date date1 = null; try { date1 = df.parse(strDate1); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } Date date2 = null; try { date2 = df.parse(strDate2); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } return new Date[] { date1, date2 }; }
  • 38.
    FilterFunc: DateWithinFilter @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcList.add(new FuncSpec(this.getClass().getName(), s)); return funcList; } Defining what inputs we accept stay tuned for what happens when violated
  • 39.
    FilterFunc: DateWithinFilter $ catsrc/test/resources/purchasetimes 1234 2010-06-01 10:31:22 2010-06-01 10:32:22 7121 2010-06-01 10:30:18 2010-06-01 11:02:59 1234 2010-06-01 10:40:18 2010-06-01 10:45:32 7681 lol wut 4532 2010-06-01 11:37:18 2010-06-01 11:42:59 $ cat src/test/resources/purchasetimes.pig REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; purchasetimes = LOAD '$dir/purchasetimes' AS (userid: int, datein: chararray, dateout: chararray); quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout, 600000); DUMP quickybuyers; $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ (1234,2010-06-01 10:31:22,2010-06-01 10:32:22) (1234,2010-06-01 10:40:18,2010-06-01 10:45:32) (4532,2010-06-01 11:37:18,2010-06-01 11:42:59)
  • 40.
    EvalFunc: not passingin correct number args $ cat src/test/resources/purchasetimes.pig quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout); $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ 2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit. Please use an explicit cast. Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop- demo-code/pig_1276820742917.log log file has: at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi sitor.java:1197) so error caught before loading data
  • 41.
    Agenda 1What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 42.
    LoadFunc: definition • How does something get loaded into Pig? • A = load ‘B’; • But what is actually going on? • A = load ‘B’ using PigStorage(); • PigStorage is a LoadFunc that reads off of disk and splits on tab to create a Tuple
  • 43.
    LoadFunc: definition • LoadFunc is an interface with a number of methods, the most interesting being • bindTo(fileName,inputStream,offset,end) • Tuple getNext() • Extend from UTF8StorageConverter like PigStorage to get defaults • Overview: PigStorage’s getNext() creates an array of objects after splitting on a tab and puts those into a Tuple
  • 44.
    LoadFunc: make ourown • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those
  • 45.
    LoadFunc: make ourown • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those • Goal: • A = LOAD 'urls' USING QuerystringLoader('query', 'userid') AS (query: chararray, userid : int);
  • 46.
    LoadFunc: QuerystringLoader •Passing in constructor arguments from the pig script is easy: • public QuerystringLoader(String... fieldNames) • bindTo is almost exactly the same as the PigStorage one, using the PigLineRecordReader to parse the InputStream • Tuple getTuple() is where the action happens • parse the querystring into a Map • loop through the fields given in the constructor • return a Tuple of a list of those objects
  • 47.
    LoadFunc: QuerystringLoader getTuple() @Override public Tuple getNext() throws IOException { if (in == null || in.getPosition() > end) { return null; } Text value = new Text(); boolean notDone = in.next(value); if (!notDone) { return null; } Map<String, Object> parameters = getParameterMap(value.toString()); List<String> output = new ArrayList<String>(); for (String fieldName : m_fieldsInOrder) { Object object = parameters.get(fieldName); if (object == null) { output.add(null); continue; } if (object instanceof String) { output.add((String) object); } else { List<String> objectVal = (List<String>) object; output.add(objectVal.get(0)); } } return mTupleFactory.newTupleNoCopy(output); }
  • 48.
    LoadFunc: notes •boolean okay=in.next(tuple) is how to get the next parsed line • getParameterMap(url) splits querystring into a Map<String,Object> • Pig handles type conversion for you, just hand back a Tuple. • In this case the Tuple can be made up of anything so user specifies the schema in the script • AS (query:chararray, userid:int)
  • 49.
    RegexLoader Same concept,pass in a Pattern for the constructor and have getTuple() return only the matched parts @Override public Tuple getNext() throws IOException { Matcher m = m_linePattern.matcher(value.toString()); if (!m.matches()) { return EmptyTuple.getInstance(); } List<String> regexMatches = new ArrayList<String>(); for (int i = 1; i <= m.groupCount(); i++) { regexMatches.add(m.group(i)); } return mTupleFactory.newTupleNoCopy(regexMatches); }
  • 50.
    Agenda 1What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 51.
    Piggybank •CVS repository of common UDFs • Excited about it at first, doesn’t appear to be used that much • Needs to be an easier way of doing this • CPAN (Perl) for Pig would be great • register pigpan://Math::FFT • brings down the jars from a maven-like repository and tells pig where to load from • any takers? Looking into it
  • 52.
    Bonus section: unittesting @Test public void testRepeatQueryParams() throws IOException { String url = "http://localhost/foo?a=123&a=456nx=y nhttp://localhost/bar?a=761&b=hi"; QuerystringLoader loader = new QuerystringLoader("a", "b"); InputStream in = new ByteArrayInputStream(url.getBytes()); loader.bindTo(null, new BufferedPositionedInputStream(in), 0, url.length()); Tuple tuple = loader.getNext(); assertEquals("123", (String) tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals(2, tuple.size()); assertNull(tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals("761", (String) tuple.get(0)); assertEquals("hi", (String) tuple.get(1)); }
  • 53.
    Resources UDF reference: http://hadoop.apache.org/pig/docs/r0.5.0/ piglatin_reference.html Code samples: http://github.com/seattlehadoop Presentation: http://www.slideshare.net/seattlehadoop