0
Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn)© Hortonworks Inc. 2011          ...
What is Apache Pig?  Pig Latin, a high level                                                An engine that  data processin...
Pig-latin example• Query : Get the list of web pages visited by users whose  age is between 20 and 29 years.USERS = load „...
Why pig ?• Faster development  – Fewer lines of code  – Don‟t re-invent the wheel• Flexible  – Metadata is optional  – Ext...
Before pig 0.9   p1.pig                           p2.pig   p3.pig     Architecting the Future of Big Data                 ...
With pig macros                                  p1.pig           p2.pig   p3.pigmacro1.pig                               ...
With pig macros  p1.pig                                   p1.pig   rm_bots.pig                                            ...
Pig macro example• Page_views data : (url, timestamp, uname, …)• Find  1. top 5 users (uname) by page views  2. top 10 mos...
Pig Macro examplepage_views = LOAD ..                           /* top x macro *//* get top 5 users by page view */       ...
Pig macro• Coming soon – piggybank with pig macros     Architecting the Future of Big Data                                ...
Writing data flow program• Writing a complex data pipeline is an iterative process     Load                               ...
Writing data flow program    Load                                   Load  Transform                                Join   ...
Writing data flow program• Debug!      Load                                   Load                                        ...
Common approaches to debug• Running on real (large) data   –Inefficient, takes longer• Running on (small) samples   –Empty...
Pig illustrate command• Objective- Show examples for i/o of each statement that  are  –Realistic  –Complete  –Concise  –Ge...
Illustrate command demo   Architecting the Future of Big Data                                         Page 16   © Hortonwo...
Pig relation-as-scalar• In pig each statement alias is a relation   –Relation is a set of records• Task: Get list of pages...
Pig relation-as-scalar• Step 1 is like .. = load .. ..= group .. al_rel = foreach .. AVG(ltime) as avg_ltime;• Step 2 look...
Pig relation-as-scalar• Getting results of step 1 (average_gpa)   –Join result of step 1 with students relation, or   –Wri...
UDF in Scripting Language• Benefit   –Use legacy code   –Use library in scripting language   –Leverage Hadoop for non-Java...
Writing a Python UDFWrite a Python UDF                              register util.py using jython as util;@outputSchema("w...
Use NLTK in Pig• Exampleregister ‟nltk_util.py using jython as nltk;    Pig eats everything……B = foreach A generate nltk.t...
Comparison with Pig Streaming                                            Pig Streaming             Scripting UDF          ...
Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc<Object> {               Convert Pig input...
Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEn...
Algebraic UDF in JRubyclass SUM < AlgebraicPigUdf   output_schema Schema.long  def initial num    num                     ...
Pig Embedding• Embed Pig inside scripting language  –Python  –JavaScript• Algorithms which cannot complete using one Pig s...
Pig Embeddingfrom org.apache.pig.scripting import Pig                                                                   Co...
Convergence ExampleP = Pig.compile(“““DEFINE myudf MyUDF($param);                   A = load ‟input‟;                   B ...
Pig Embedding • Running embeded Pig script    pig sample.py                                                   while True: ...
Nested Operator• Nested Operator: Operator inside foreach  B = group A by name;  C = foreach B {    C0 = limit A 10;    ge...
Nested Cross/ForEach           ì(i0, a)ü                                              ì(i0, 0)ü    A=     í       ý       ...
HCatalog Integration• Hcatalog             Pig                            Map Reduce   Hive                               ...
Misc Loaders• HBaseStorage  –Pig builtin• AvroStorage  –Piggybank• CassandraStorage  –In Cassandra code base• MongoStorage...
TalendEnterprise Data Integration• Talend Open Studio for Big Data   – Feature-rich Job Designer   – Rich palette of pre-b...
Questions   Architecting the Future of Big Data                                         Page 36   © Hortonworks Inc. 2011
Upcoming SlideShare
Loading in...5
×

Pig programming is more fun: New features in Pig

11,478

Published on

In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.

Published in: Technology, Business
1 Comment
15 Likes
Statistics
Notes
  • good, look forward to more interesting pig stuff and tutorials
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
11,478
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
277
Comments
1
Likes
15
Embeds 0
No embeds

No notes for slide

Transcript of "Pig programming is more fun: New features in Pig"

  1. 1. Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn)© Hortonworks Inc. 2011 Page 1
  2. 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig language. Latin locally or on a Hadoop cluster.Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  3. 3. Pig-latin example• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.USERS = load „users‟ as (uid, age);USERS_20s = filter USERS by age >= 20 and age <= 29;PVs = load „pages‟ as (url, uid, timestamp);PVs_u20s = join USERS_20s by uid, PVs by uid; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  4. 4. Why pig ?• Faster development – Fewer lines of code – Don‟t re-invent the wheel• Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  5. 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  6. 6. With pig macros p1.pig p2.pig p3.pigmacro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  7. 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  8. 8. Pig macro example• Page_views data : (url, timestamp, uname, …)• Find 1. top 5 users (uname) by page views 2. top 10 most visited urls Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  9. 9. Pig Macro examplepage_views = LOAD .. /* top x macro *//* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)u_grp = GROUP .. by uname; RETURNS top_num_recs {u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; }/* get top 10 urls by page view */ -----------------------------------------url_grp = GROUP .. by url; page_views = LOAD ..url_count = FOREACH .. COUNT . /* get top 5 users by page view */ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,top_10_urls = LIMIT ord_url.. 10; uname, 5);DUMP top_10_urls; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  10. 10. Pig macro• Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  11. 11. Writing data flow program• Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  12. 12. Writing data flow program Load Load Transform Join Group Transform Filter No output!  Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  13. 13. Writing data flow program• Debug! Load Load Was join on Transform Join wrong attributes?Bug in Group Transform Filter transform? Did filter drop everything? Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  14. 14. Common approaches to debug• Running on real (large) data –Inefficient, takes longer• Running on (small) samples –Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  15. 15. Pig illustrate command• Objective- Show examples for i/o of each statement that are –Realistic –Complete –Concise –Generated fast• Steps –Downstream – sample and process –Prune –Upstream – generate realistic missing classes of examples –Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  16. 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  17. 17. Pig relation-as-scalar• In pig each statement alias is a relation –Relation is a set of records• Task: Get list of pages whose load time was more than average.• Steps 1. Compute average load time 2. Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  18. 18. Pig relation-as-scalar• Step 1 is like .. = load .. ..= group .. al_rel = foreach .. AVG(ltime) as avg_ltime;• Step 2 looks like page_views = load „pviews.txt‟ as (url, ltime, ..); slow_views = filter page_views by ltime > avg_ltime Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  19. 19. Pig relation-as-scalar• Getting results of step 1 (average_gpa) –Join result of step 1 with students relation, or –Write result into file, then use udf to read from file• Pig scalar feature now simplifies this- slow_views = filter page_views by ltime > al_rel.avg_ltime –Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  20. 20. UDF in Scripting Language• Benefit –Use legacy code –Use library in scripting language –Leverage Hadoop for non-Java programmer• Currently supported language –Python (0.8) –JavaScript (0.8) –Ruby (0.10)• Extensible Interface –Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  21. 21. Writing a Python UDFWrite a Python UDF register util.py using jython as util;@outputSchema("word:chararray") B = foreach A generate util.square(i);def concat(word): return word + word • Invoke Python functions when needed@outputSchemaFunction("squareSchema") • Type conversiondef square(num): – Python simple type <-> Pig simple type if num == None: – Python Array <-> Pig Bag return None – Python Dict <-> Pig Map return ((num)*(num)) – Pyton Tuple <-> Pig Tupledef squareSchema(input): return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  22. 22. Use NLTK in Pig• Exampleregister ‟nltk_util.py using jython as nltk; Pig eats everything……B = foreach A generate nltk.tokenize(sentence) Tokenize nltk_util.py Stemmingimport nltkporter = nltk.PorterStemmer() (Pig)@outputSchema("words:{(word:chararray)}") (eat)def tokenize(sentence): (everything) tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  23. 23. Comparison with Pig Streaming Pig Streaming Scripting UDF B = stream A through `perl B = foreach A generate Syntax sample.pl`; myfunc.concat(a0, a1), a2; function parameter/return stdin/tout Input/Output value entire relation particular fields Need to parse input/convert Type conversion isType Conversion type automatic Every streaming operator Organize the functions into Modularize need a separate script module Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  24. 24. Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc<Object> { Convert Pig input into Python public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = f.__call__(params); Invoke Python UDF return JythonUtils.pythonToPig(result); } Convert result to Pig public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); }} Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  25. 25. Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContextpigContext) { myudf.py def square(num): …… square JythonFunction(“square”) def concat(word): concat JythonFunction(“concat”) …… def count(bag): count JythonFunction(“count”) …… }} Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  26. 26. Algebraic UDF in JRubyclass SUM < AlgebraicPigUdf output_schema Schema.long def initial num num Initial Function end def intermed num num.flatten.inject(:+) Intermediate Function end def final num intermed(num) Final Function endend Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  27. 27. Pig Embedding• Embed Pig inside scripting language –Python –JavaScript• Algorithms which cannot complete using one Pig script –Iterative algorithm – PageRank, Kmeans, Neural Network, Apriori, etc – Parallel Independent execution – Ensemble – Divide and Conquer – Branching Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  28. 28. Pig Embeddingfrom org.apache.pig.scripting import Pig Compile Piginput= ":INPATH:/singlefile/studenttab10k” ScriptP = Pig.compile("""A = load $in as (name, age, gpa); store A into ’output;""")Q = P.bind({in:input}) Bind Variablesresult = Q.runSingle() Launch Pig Scriptresult = stats.result(A)for t in result.iterator(): Iterate result print t Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  29. 29. Convergence ExampleP = Pig.compile(“““DEFINE myudf MyUDF($param); A = load ‟input‟; B = foreach A generate MyUDF(*); store B into „output‟;””” )while True: Q = P.bind({‟ param:new_parameter}) Bind to new parameter results = Q.runSingle() iter = results.result("result").iterator() if converged: Convergence check break new_parameter = xxxxxx Change parameter Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  30. 30. Pig Embedding • Running embeded Pig script pig sample.py while True: • What happen within Pig? Q = P.bind() results = Q.runSingle() While Loop converge? Pig Script Pytho Pytho n nsample.py Script Pig Script Jython Pig End Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  31. 31. Nested Operator• Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate flatten(C0); }• Prior Pig 0.10, supported nested operator –DISTINCT, FILTER, LIMIT, and ORDER BY• New operators added in 0.10 –CROSS, FOREACH Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  32. 32. Nested Cross/ForEach ì(i0, a)ü ì(i0, 0)ü A= í ý B= í ý î(i0, b)þ î(i0,1) þ ì ì aü ì 0 ü ü ï ï ïCoGroup A, B C= í(i0, í ý, í ý)ý ï ïbþ î1 þ ï î î þ ì ì(a, 0)üü C = CoGroup A, B; ï ï ïïCross A, B ï ï(a,1) ïï D = ForEach C { í(i0, í ýý ï ï(b, 0)ïï X = Cross A, B; ï î ï(b,1) ïï î þþ Y = ForEach X generate CONCAT(f1, f2); ì ì(a0)üü ï ï ïï Generate Y;ForEach … CONCAT ï ï(a1) ïï í(i0, í ýý } ï ï(b0)ïï ï î ï(b1) ïï î þþ Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  33. 33. HCatalog Integration• Hcatalog Pig Map Reduce Hive HCatalog• HCatLoader/HCatStorage –Load/Store from HCatalog from Pig• HCatalog DDL Integration (Pig 0.11) –sql “create table student(name string, age int, gpa double);” Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  34. 34. Misc Loaders• HBaseStorage –Pig builtin• AvroStorage –Piggybank• CassandraStorage –In Cassandra code base• MongoStorage –In Mongo DB code base• JsonLoader/JsonStorage –Pig builtin Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  35. 35. TalendEnterprise Data Integration• Talend Open Studio for Big Data – Feature-rich Job Designer – Rich palette of pre-built templates – Supports HDFS, Pig, Hive, HBase, HCatalog – Apache-licensed, bundled with HDP• Key benefits – Graphical development – Robust and scalable execution – Broadest connectivity to support all systems: 450+ components – Real-time debugging © Hortonworks Inc. 2011 Page 35
  36. 36. Questions Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×