Pig programming is more fun: New features in Pig
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Pig programming is more fun: New features in Pig

on

  • 10,331 views

In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig ...

In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.

Statistics

Views

Total Views
10,331
Views on SlideShare
10,319
Embed Views
12

Actions

Likes
12
Downloads
170
Comments
1

5 Embeds 12

https://twitter.com 7
http://us-w1.rockmelt.com 2
http://tweetedtimes.com 1
https://si0.twimg.com 1
http://www.slashdocs.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • good, look forward to more interesting pig stuff and tutorials
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Pig programming is more fun: New features in Pig Presentation Transcript

  • 1. Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn)© Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig language. Latin locally or on a Hadoop cluster.Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Pig-latin example• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.USERS = load „users‟ as (uid, age);USERS_20s = filter USERS by age >= 20 and age <= 29;PVs = load „pages‟ as (url, uid, timestamp);PVs_u20s = join USERS_20s by uid, PVs by uid; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Why pig ?• Faster development – Fewer lines of code – Don‟t re-invent the wheel• Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. With pig macros p1.pig p2.pig p3.pigmacro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Pig macro example• Page_views data : (url, timestamp, uname, …)• Find 1. top 5 users (uname) by page views 2. top 10 most visited urls Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Pig Macro examplepage_views = LOAD .. /* top x macro *//* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)u_grp = GROUP .. by uname; RETURNS top_num_recs {u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; }/* get top 10 urls by page view */ -----------------------------------------url_grp = GROUP .. by url; page_views = LOAD ..url_count = FOREACH .. COUNT . /* get top 5 users by page view */ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,top_10_urls = LIMIT ord_url.. 10; uname, 5);DUMP top_10_urls; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Pig macro• Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Writing data flow program• Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Writing data flow program Load Load Transform Join Group Transform Filter No output!  Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Writing data flow program• Debug! Load Load Was join on Transform Join wrong attributes?Bug in Group Transform Filter transform? Did filter drop everything? Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Common approaches to debug• Running on real (large) data –Inefficient, takes longer• Running on (small) samples –Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Pig illustrate command• Objective- Show examples for i/o of each statement that are –Realistic –Complete –Concise –Generated fast• Steps –Downstream – sample and process –Prune –Upstream – generate realistic missing classes of examples –Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Pig relation-as-scalar• In pig each statement alias is a relation –Relation is a set of records• Task: Get list of pages whose load time was more than average.• Steps 1. Compute average load time 2. Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Pig relation-as-scalar• Step 1 is like .. = load .. ..= group .. al_rel = foreach .. AVG(ltime) as avg_ltime;• Step 2 looks like page_views = load „pviews.txt‟ as (url, ltime, ..); slow_views = filter page_views by ltime > avg_ltime Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Pig relation-as-scalar• Getting results of step 1 (average_gpa) –Join result of step 1 with students relation, or –Write result into file, then use udf to read from file• Pig scalar feature now simplifies this- slow_views = filter page_views by ltime > al_rel.avg_ltime –Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. UDF in Scripting Language• Benefit –Use legacy code –Use library in scripting language –Leverage Hadoop for non-Java programmer• Currently supported language –Python (0.8) –JavaScript (0.8) –Ruby (0.10)• Extensible Interface –Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Writing a Python UDFWrite a Python UDF register util.py using jython as util;@outputSchema("word:chararray") B = foreach A generate util.square(i);def concat(word): return word + word • Invoke Python functions when needed@outputSchemaFunction("squareSchema") • Type conversiondef square(num): – Python simple type <-> Pig simple type if num == None: – Python Array <-> Pig Bag return None – Python Dict <-> Pig Map return ((num)*(num)) – Pyton Tuple <-> Pig Tupledef squareSchema(input): return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Use NLTK in Pig• Exampleregister ‟nltk_util.py using jython as nltk; Pig eats everything……B = foreach A generate nltk.tokenize(sentence) Tokenize nltk_util.py Stemmingimport nltkporter = nltk.PorterStemmer() (Pig)@outputSchema("words:{(word:chararray)}") (eat)def tokenize(sentence): (everything) tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Comparison with Pig Streaming Pig Streaming Scripting UDF B = stream A through `perl B = foreach A generate Syntax sample.pl`; myfunc.concat(a0, a1), a2; function parameter/return stdin/tout Input/Output value entire relation particular fields Need to parse input/convert Type conversion isType Conversion type automatic Every streaming operator Organize the functions into Modularize need a separate script module Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc<Object> { Convert Pig input into Python public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = f.__call__(params); Invoke Python UDF return JythonUtils.pythonToPig(result); } Convert result to Pig public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); }} Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContextpigContext) { myudf.py def square(num): …… square JythonFunction(“square”) def concat(word): concat JythonFunction(“concat”) …… def count(bag): count JythonFunction(“count”) …… }} Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Algebraic UDF in JRubyclass SUM < AlgebraicPigUdf output_schema Schema.long def initial num num Initial Function end def intermed num num.flatten.inject(:+) Intermediate Function end def final num intermed(num) Final Function endend Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Pig Embedding• Embed Pig inside scripting language –Python –JavaScript• Algorithms which cannot complete using one Pig script –Iterative algorithm – PageRank, Kmeans, Neural Network, Apriori, etc – Parallel Independent execution – Ensemble – Divide and Conquer – Branching Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Pig Embeddingfrom org.apache.pig.scripting import Pig Compile Piginput= ":INPATH:/singlefile/studenttab10k” ScriptP = Pig.compile("""A = load $in as (name, age, gpa); store A into ’output;""")Q = P.bind({in:input}) Bind Variablesresult = Q.runSingle() Launch Pig Scriptresult = stats.result(A)for t in result.iterator(): Iterate result print t Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Convergence ExampleP = Pig.compile(“““DEFINE myudf MyUDF($param); A = load ‟input‟; B = foreach A generate MyUDF(*); store B into „output‟;””” )while True: Q = P.bind({‟ param:new_parameter}) Bind to new parameter results = Q.runSingle() iter = results.result("result").iterator() if converged: Convergence check break new_parameter = xxxxxx Change parameter Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Pig Embedding • Running embeded Pig script pig sample.py while True: • What happen within Pig? Q = P.bind() results = Q.runSingle() While Loop converge? Pig Script Pytho Pytho n nsample.py Script Pig Script Jython Pig End Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Nested Operator• Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate flatten(C0); }• Prior Pig 0.10, supported nested operator –DISTINCT, FILTER, LIMIT, and ORDER BY• New operators added in 0.10 –CROSS, FOREACH Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. Nested Cross/ForEach ì(i0, a)ü ì(i0, 0)ü A= í ý B= í ý î(i0, b)þ î(i0,1) þ ì ì aü ì 0 ü ü ï ï ïCoGroup A, B C= í(i0, í ý, í ý)ý ï ïbþ î1 þ ï î î þ ì ì(a, 0)üü C = CoGroup A, B; ï ï ïïCross A, B ï ï(a,1) ïï D = ForEach C { í(i0, í ýý ï ï(b, 0)ïï X = Cross A, B; ï î ï(b,1) ïï î þþ Y = ForEach X generate CONCAT(f1, f2); ì ì(a0)üü ï ï ïï Generate Y;ForEach … CONCAT ï ï(a1) ïï í(i0, í ýý } ï ï(b0)ïï ï î ï(b1) ïï î þþ Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • 33. HCatalog Integration• Hcatalog Pig Map Reduce Hive HCatalog• HCatLoader/HCatStorage –Load/Store from HCatalog from Pig• HCatalog DDL Integration (Pig 0.11) –sql “create table student(name string, age int, gpa double);” Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • 34. Misc Loaders• HBaseStorage –Pig builtin• AvroStorage –Piggybank• CassandraStorage –In Cassandra code base• MongoStorage –In Mongo DB code base• JsonLoader/JsonStorage –Pig builtin Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • 35. TalendEnterprise Data Integration• Talend Open Studio for Big Data – Feature-rich Job Designer – Rich palette of pre-built templates – Supports HDFS, Pig, Hive, HBase, HCatalog – Apache-licensed, bundled with HDP• Key benefits – Graphical development – Robust and scalable execution – Broadest connectivity to support all systems: 450+ components – Real-time debugging © Hortonworks Inc. 2011 Page 35
  • 36. Questions Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011