Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn)© Hortonworks Inc. 2011          ...
What is Apache Pig?  Pig Latin, a high level                                                An engine that  data processin...
Pig-latin example• Query : Get the list of pages visited by users whose age is  between 20 and 25 years.users = load users...
Why pig ?• Faster development  –  Fewer lines of code  –  Don’t re-invent the wheel• Flexible  –  Metadata is optional  – ...
Before pig 0.9   p1.pig                           p2.pig   p3.pig     Architecting the Future of Big Data                 ...
With pig macros                                  p1.pig           p2.pig   p3.pigmacro1.pig                               ...
With pig macros  p1.pig                                   p1.pig   rm_bots.pig                                            ...
Pig macro example• Page_views data : (user_name, url, timestamp, …)• Find top 5 users by page views• Find top 10 most visi...
Pig Macro examplepage_views = LOAD ..                           /* top x macro *//* get top 5 users by page view */       ...
Pig macro• Coming soon – piggybank with pig macros     Architecting the Future of Big Data                                ...
Writing data flow program• Writing a complex data pipeline is an iterative process     Load                               ...
Writing data flow program    Load                                   Load  Transform                                Join   ...
Writing data flow program• Debug!        Load                                   Load                                      ...
Common approaches to debug• Running on real (large) data  – Inefficient, takes longer• Running on (small) samples  – Empty...
Pig illustrate command• Objective- Show examples for i/o of each statement that  are  – Realistic  – Complete  – Concise  ...
Illustrate command demo   Architecting the Future of Big Data                                         Page 16   © Hortonwo...
Pig relation-as-scalar• In pig each statement alias is a relation   – Relation is a set of records• Task: Get list of page...
Pig relation-as-scalar• Step 1 is like  .. = load ..!  ..= group ..!  al_rel = foreach .. AVG(ltime) as avg_ltime;!• Step ...
Pig relation-as-scalar• Getting results of step 1 (average_gpa)   – Join result of step 1 with students relation, or   – W...
UDF in Scripting Language• Benefit   – Use legacy code   – Use library in scripting language   – Leverage Hadoop for non-J...
Writing a Jython UDFWrite a Jython UDF                             •  Invoke Jython UDF when                              ...
Use NLTK in Pig• Example   register ’nltk_util.py using jython as nltk;   ……   B = foreach A generate nltk.tokenize(senten...
Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc<Object> {   public Object exec(Tuple tupl...
Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEn...
Algebraic UDF in JRubyclass Count < AlgebraicPigUdf   output_schema Schema.long  def initial t    t.nil? ? 0 : 1  end  def...
Pig Embedding• Embed Pig inside scripting language  – Python  – JavaScript• Algorithms which cannot complete using one Pig...
Pig Embeddingfrom org.apache.pig.scripting import Pig                                                                     ...
Pig Embedding • Running embeded Pig script    pig sample.py • What happen within Pig?                                     ...
Nested Operator• Nested Operator: Operator inside foreach  B = group A by name;  C = foreach B {    C0 = limit A 10;    ge...
Nested Cross/ForeachA = LOAD ’studenttab10k as (name:chararray, age:int, gpa:double);B = LOAD ’votertab10k as (name:charar...
Misc Loaders• HBaseStorage• CassandraStorage• AvroStorage• JsonLoader/JsonStorage     Architecting the Future of Big Data ...
New operators to come• Will be available in Pig 0.11   – RANK       – A distributed RANK implementation for Pig   – CUBE  ...
Upcoming SlideShare
Loading in...5
×

Pig programming is fun

5,739

Published on

Published in: Technology
1 Comment
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,739
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide

Pig programming is fun

  1. 1. Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn)© Hortonworks Inc. 2011 Page 1
  2. 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig Latin language. locally or on a Hadoop cluster.Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  3. 3. Pig-latin example• Query : Get the list of pages visited by users whose age is between 20 and 25 years.users = load users as (name, age);users_18_to_25 = filter users by age > 20 and age <= 25;page_views = load pages as (user, url);page_views_u18_to_25 = join users_18_to_25 by name,page_views by user; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  4. 4. Why pig ?• Faster development –  Fewer lines of code –  Don’t re-invent the wheel• Flexible –  Metadata is optional –  Extensible –  Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  5. 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  6. 6. With pig macros p1.pig p2.pig p3.pigmacro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  7. 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  8. 8. Pig macro example• Page_views data : (user_name, url, timestamp, …)• Find top 5 users by page views• Find top 10 most visited pages. Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  9. 9. Pig Macro examplepage_views = LOAD .. /* top x macro *//* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)u_grp = GROUP .. by uname; RETURNS top_num_recs {u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; }/* get top 10 urls by page view */ -----------------------------------------url_grp = GROUP .. by url; page_views = LOAD ..url_count = FOREACH .. COUNT . /* get top 5 users by page view */ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,top_10_urls = LIMIT ord_url.. 10; uname, 5);DUMP top_10_urls; DUMP top_5_users; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  10. 10. Pig macro• Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  11. 11. Writing data flow program• Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  12. 12. Writing data flow program Load Load Transform Join Group Transform Filter No output! L Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  13. 13. Writing data flow program• Debug! Load Load Was  join  on   Transform Join wrong   a2ributes?  Bug  in   Group Transform Filter transform?   Did  filter  drop   everything?   Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  14. 14. Common approaches to debug• Running on real (large) data – Inefficient, takes longer• Running on (small) samples – Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  15. 15. Pig illustrate command• Objective- Show examples for i/o of each statement that are – Realistic – Complete – Concise – Generated fast• Steps – Downstream – sample and process – Prune – Upstream – generate realistic missing classes of examples – Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  16. 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  17. 17. Pig relation-as-scalar• In pig each statement alias is a relation – Relation is a set of records• Task: Get list of pages whose load time was more than average.• Steps 1.  Compute average load time 2.  Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  18. 18. Pig relation-as-scalar• Step 1 is like .. = load ..! ..= group ..! al_rel = foreach .. AVG(ltime) as avg_ltime;!• Step 2 looks like page_views = load ‘pviews.txt’ as ! (url, ltime, ..);! ! slow_views = filter page_views by ! ltime > avg_ltime! Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  19. 19. Pig relation-as-scalar• Getting results of step 1 (average_gpa) – Join result of step 1 with students relation, or – Write result into file, then use udf to read from file• Pig scalar feature now simplifies this- slow_views = filter page_views by ! ltime > al_rel.avg_ltime! – Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  20. 20. UDF in Scripting Language• Benefit – Use legacy code – Use library in scripting language – Leverage Hadoop for non-Java programmer• Currently supported language – Python – JavaScript – Ruby• Extensible Interface – Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  21. 21. Writing a Jython UDFWrite a Jython UDF •  Invoke Jython UDF when needed@outputSchema("word:chararray") •  Type conversiondef concat(word): –  Simple type return word + word –  Python Array <-> Pig Bag –  Python Dict <-> Pig Map –  Pyton Tuple <-> Pig Tuple@outputSchemaFunction("squareSchema") •  Convey schema to Pigdef square(num): –  outputSchema –  outputSchemaFunction if num == None: return None register util.py using jython as util; return ((num)*(num)) B = foreach A generate util.squaredef squareSchema(input): (i)); return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  22. 22. Use NLTK in Pig• Example register ’nltk_util.py using jython as nltk; …… B = foreach A generate nltk.tokenize(sentence) nltk_util.py import nltk porter = nltk.PorterStemmer() @outputSchema("words:{(word:chararray)}") def tokenize(sentence): tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  23. 23. Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc<Object> { public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = function.__call__(params); return JythonUtils.pythonToPig(result); } public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); }} Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  24. 24. Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContextpigContext) { PythonInterpreter pi = Interpreter.interpreter; pi.execfile(path); for (PyTuple item : pi.getLocals().items()) funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "(" + path + "," + item. get(0)+")"); pigContext.registerFunction(namespace + key, funcspec); }} Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  25. 25. Algebraic UDF in JRubyclass Count < AlgebraicPigUdf output_schema Schema.long def initial t t.nil? ? 0 : 1 end def intermed t return 0 if t.nil? t.flatten.inject(:+) end def final t intermed(t) endend Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  26. 26. Pig Embedding• Embed Pig inside scripting language – Python – JavaScript• Algorithms which cannot complete using one Pig script – Iterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc – Parallel execution Random forrest – Divide and Conquer – Branching Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  27. 27. Pig Embeddingfrom org.apache.pig.scripting import Pig Compile  Pig  input= ":INPATH:/singlefile/studenttab10k” Script  P = Pig.compile("""A = load $in as (name, age, gpa); store A into ’output;""") Bind  Variables  Q = P.bind({in:input})result = Q.runSingle() Launch  Pig  Script  if result.isSuccessful(): print "Pig job PASSED”else: raise "Pig job FAILED" Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  28. 28. Pig Embedding • Running embeded Pig script pig sample.py • What happen within Pig? Pig Script Python Python Script Scriptsample.py Pig Jython Pig Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  29. 29. Nested Operator• Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate C0; }• Prior Pig 0.10, supported nested operator – DISTINCT, FILTER, LIMIT, and ORDER BY• New operators added in 0.10 – CROSS, FOREACH Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  30. 30. Nested Cross/ForeachA = LOAD ’studenttab10k as (name:chararray, age:int, gpa:double);B = LOAD ’votertab10k as (name:chararray, age:int, registration,contributions:double);C = cogroup A by name, B by name;D = foreach C { C1 = filter A by gpa > 4; C2 = filter B by contributions > 500; C3 = cross C1, C2; C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, _), (chararray)contributions); generate flatten(C4);}store D into ’output Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  31. 31. Misc Loaders• HBaseStorage• CassandraStorage• AvroStorage• JsonLoader/JsonStorage Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  32. 32. New operators to come• Will be available in Pig 0.11 – RANK – A distributed RANK implementation for Pig – CUBE Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011

×