Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Pig UDFsExtending Pig to solve complex tasks   UDF = User Defined Functions
Your speaker today:          Christoph Bauer          java developer 10+ years          one of the founders          Helpi...
Why use PIG● ad-hoc way for creating and executing  map/reduce jobs● simple, high-level language● more natural for analyst...
Done.        http://leesfishandphotos.blogspot.de
Oh, wait...
UDFs to the rescueWriting user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write th...
Do whatever you want● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language proc...
User Defined Functions● EvalFunc<T>  public <T> exec(Tuple input)● FilterFunc  public Boolean exec(Tuple input)● Aggregate...
What? Why?companyName          companyAdress                  Net Worth                       companyAdress               ...
Exampler1, { q1:[(t1, "v1") , (t4, "v2")],      q2:[(t2, "v3"),(t7, "v4")] }...apply UDFr1, t1, q1:"v1", q2:"v4"r1, t3, q1...
LATESTpublic class LATEST extends EvalFunc<Tuple> {    public Tuple exec(Tuple input) throws IOException {    }}
LATEST (contd.)public Tuple exec(Tuple input) throws IOException {    int numTuples = input.size();    Tuple result = tupl...
SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> {    @Override    public DataBag exec(Tuple input) throws IOExcep...
SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... {    int numTuples = input.size();    Tuple resu...
PigLatin                   r1, { q1:[(t1, "v1") , (t4, "v2")],                         q2:[(t2, "v3"),(t7, "v4")] }REGISTE...
Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot                 (2000-01-01 2013-01-01 1y);...public SNAPSHOTS...
I didnt talk about● UDFs run as a single instance in every  mapper, reducer, ... use instance variables  for locally share...
SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) {    List out = new ArrayList<Schema.FieldSchema>();   ...
Reality check● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
Wrapping it upWe at Oberbaum Concept developed a bunchof PIG Functions handling versioned data inHBase.● Rewrote HBaseStor...
Questions?
Thank you!                Christoph Bauerchristoph.bauer@oberbaum-concept.comhttps://www.xing.com/profile/Christoph_Bauer62
Upcoming SlideShare
Loading in …5
×

Apache PIG - User Defined Functions

12,558 views

Published on

Extending Pig to solve complex tasks

Published in: Education
  • bigdatacoder.com
    http://bigdatacoder.com/wiki/index.php?title=Main_Page
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache PIG - User Defined Functions

  1. 1. Apache Pig UDFsExtending Pig to solve complex tasks UDF = User Defined Functions
  2. 2. Your speaker today: Christoph Bauer java developer 10+ years one of the founders Helping our clients to use and understand their (big) data working in "BigData" since 2010
  3. 3. Why use PIG● ad-hoc way for creating and executing map/reduce jobs● simple, high-level language● more natural for analysts than map/reduce
  4. 4. Done. http://leesfishandphotos.blogspot.de
  5. 5. Oh, wait...
  6. 6. UDFs to the rescueWriting user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...
  7. 7. Do whatever you want● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing ...● much more...
  8. 8. User Defined Functions● EvalFunc<T> public <T> exec(Tuple input)● FilterFunc public Boolean exec(Tuple input)● Aggregate Functions public interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal(); }● Load/Store Functions public Tuple getNext() public void putNext(Tuple input);
  9. 9. What? Why?companyName companyAdress Net Worth companyAdress Net Worth companyAddress Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth2010 | companyName | current Address | historical Net Worth2011 | companyName | current Address | historical Net Worth2012 | companyName | current Address | historical Net Worth
  10. 10. Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }...apply UDFr1, t1, q1:"v1", q2:"v4"r1, t3, q1:"v1", q2:"v4"r1, t5, q1:"v2", q2:"v4"SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
  11. 11. LATESTpublic class LATEST extends EvalFunc<Tuple> { public Tuple exec(Tuple input) throws IOException { }}
  12. 12. LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result;}
  13. 13. SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>(); DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur)); dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
  14. 14. SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result;}
  15. 15. PigLatin r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }REGISTER my-udf.jarDEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot (2000-01-01 2013-01-01 1y);A = LOAD inputTable AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO output.csv;
  16. 16. Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot (2000-01-01 2013-01-01 1y);...public SNAPSHOTS(String start, String end, String increment){ super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}
  17. 17. I didnt talk about● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface● do implement public Schema outputSchema(Schema input)● report progress when doing time consuming stuff● Performance?
  18. 18. SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG)); for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}
  19. 19. Reality check● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
  20. 20. Wrapping it upWe at Oberbaum Concept developed a bunchof PIG Functions handling versioned data inHBase.● Rewrote HBaseStorage● UDFs for Snapshots, LatestRight now we are trying to push our changesback into PIG.
  21. 21. Questions?
  22. 22. Thank you! Christoph Bauerchristoph.bauer@oberbaum-concept.comhttps://www.xing.com/profile/Christoph_Bauer62

×