Apache PIG - User Defined Functions

11,994 views
11,704 views

Published on

Extending Pig to solve complex tasks

Published in: Education
1 Comment
3 Likes
Statistics
Notes
  • bigdatacoder.com
    http://bigdatacoder.com/wiki/index.php?title=Main_Page
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
11,994
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
97
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Apache PIG - User Defined Functions

  1. 1. Apache Pig UDFsExtending Pig to solve complex tasks UDF = User Defined Functions
  2. 2. Your speaker today: Christoph Bauer java developer 10+ years one of the founders Helping our clients to use and understand their (big) data working in "BigData" since 2010
  3. 3. Why use PIG● ad-hoc way for creating and executing map/reduce jobs● simple, high-level language● more natural for analysts than map/reduce
  4. 4. Done. http://leesfishandphotos.blogspot.de
  5. 5. Oh, wait...
  6. 6. UDFs to the rescueWriting user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...
  7. 7. Do whatever you want● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing ...● much more...
  8. 8. User Defined Functions● EvalFunc<T> public <T> exec(Tuple input)● FilterFunc public Boolean exec(Tuple input)● Aggregate Functions public interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal(); }● Load/Store Functions public Tuple getNext() public void putNext(Tuple input);
  9. 9. What? Why?companyName companyAdress Net Worth companyAdress Net Worth companyAddress Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth2010 | companyName | current Address | historical Net Worth2011 | companyName | current Address | historical Net Worth2012 | companyName | current Address | historical Net Worth
  10. 10. Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }...apply UDFr1, t1, q1:"v1", q2:"v4"r1, t3, q1:"v1", q2:"v4"r1, t5, q1:"v2", q2:"v4"SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
  11. 11. LATESTpublic class LATEST extends EvalFunc<Tuple> { public Tuple exec(Tuple input) throws IOException { }}
  12. 12. LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result;}
  13. 13. SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>(); DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur)); dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
  14. 14. SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result;}
  15. 15. PigLatin r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }REGISTER my-udf.jarDEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot (2000-01-01 2013-01-01 1y);A = LOAD inputTable AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO output.csv;
  16. 16. Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot (2000-01-01 2013-01-01 1y);...public SNAPSHOTS(String start, String end, String increment){ super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}
  17. 17. I didnt talk about● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface● do implement public Schema outputSchema(Schema input)● report progress when doing time consuming stuff● Performance?
  18. 18. SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG)); for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}
  19. 19. Reality check● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
  20. 20. Wrapping it upWe at Oberbaum Concept developed a bunchof PIG Functions handling versioned data inHBase.● Rewrote HBaseStorage● UDFs for Snapshots, LatestRight now we are trying to push our changesback into PIG.
  21. 21. Questions?
  22. 22. Thank you! Christoph Bauerchristoph.bauer@oberbaum-concept.comhttps://www.xing.com/profile/Christoph_Bauer62

×