• Save
Pig programming is fun
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
6,173
On Slideshare
5,527
From Embeds
646
Number of Embeds
7

Actions

Shares
Downloads
0
Comments
1
Likes
12

Embeds 646

http://eventifier.co 478
http://eventifier.com 162
https://twitter.com 2
https://hwtest.uservoice.com 1
http://flask.radcool.co 1
https://www.google.com 1
http://www.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn)© Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig Latin language. locally or on a Hadoop cluster.Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Pig-latin example• Query : Get the list of pages visited by users whose age is between 20 and 25 years.users = load users as (name, age);users_18_to_25 = filter users by age > 20 and age <= 25;page_views = load pages as (user, url);page_views_u18_to_25 = join users_18_to_25 by name,page_views by user; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Why pig ?• Faster development –  Fewer lines of code –  Don’t re-invent the wheel• Flexible –  Metadata is optional –  Extensible –  Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. With pig macros p1.pig p2.pig p3.pigmacro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Pig macro example• Page_views data : (user_name, url, timestamp, …)• Find top 5 users by page views• Find top 10 most visited pages. Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Pig Macro examplepage_views = LOAD .. /* top x macro *//* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)u_grp = GROUP .. by uname; RETURNS top_num_recs {u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; }/* get top 10 urls by page view */ -----------------------------------------url_grp = GROUP .. by url; page_views = LOAD ..url_count = FOREACH .. COUNT . /* get top 5 users by page view */ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,top_10_urls = LIMIT ord_url.. 10; uname, 5);DUMP top_10_urls; DUMP top_5_users; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Pig macro• Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Writing data flow program• Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Writing data flow program Load Load Transform Join Group Transform Filter No output! L Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Writing data flow program• Debug! Load Load Was  join  on   Transform Join wrong   a2ributes?  Bug  in   Group Transform Filter transform?   Did  filter  drop   everything?   Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Common approaches to debug• Running on real (large) data – Inefficient, takes longer• Running on (small) samples – Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Pig illustrate command• Objective- Show examples for i/o of each statement that are – Realistic – Complete – Concise – Generated fast• Steps – Downstream – sample and process – Prune – Upstream – generate realistic missing classes of examples – Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Pig relation-as-scalar• In pig each statement alias is a relation – Relation is a set of records• Task: Get list of pages whose load time was more than average.• Steps 1.  Compute average load time 2.  Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Pig relation-as-scalar• Step 1 is like .. = load ..! ..= group ..! al_rel = foreach .. AVG(ltime) as avg_ltime;!• Step 2 looks like page_views = load ‘pviews.txt’ as ! (url, ltime, ..);! ! slow_views = filter page_views by ! ltime > avg_ltime! Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Pig relation-as-scalar• Getting results of step 1 (average_gpa) – Join result of step 1 with students relation, or – Write result into file, then use udf to read from file• Pig scalar feature now simplifies this- slow_views = filter page_views by ! ltime > al_rel.avg_ltime! – Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. UDF in Scripting Language• Benefit – Use legacy code – Use library in scripting language – Leverage Hadoop for non-Java programmer• Currently supported language – Python – JavaScript – Ruby• Extensible Interface – Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Writing a Jython UDFWrite a Jython UDF •  Invoke Jython UDF when needed@outputSchema("word:chararray") •  Type conversiondef concat(word): –  Simple type return word + word –  Python Array <-> Pig Bag –  Python Dict <-> Pig Map –  Pyton Tuple <-> Pig Tuple@outputSchemaFunction("squareSchema") •  Convey schema to Pigdef square(num): –  outputSchema –  outputSchemaFunction if num == None: return None register util.py using jython as util; return ((num)*(num)) B = foreach A generate util.squaredef squareSchema(input): (i)); return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Use NLTK in Pig• Example register ’nltk_util.py using jython as nltk; …… B = foreach A generate nltk.tokenize(sentence) nltk_util.py import nltk porter = nltk.PorterStemmer() @outputSchema("words:{(word:chararray)}") def tokenize(sentence): tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc<Object> { public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = function.__call__(params); return JythonUtils.pythonToPig(result); } public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); }} Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContextpigContext) { PythonInterpreter pi = Interpreter.interpreter; pi.execfile(path); for (PyTuple item : pi.getLocals().items()) funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "(" + path + "," + item. get(0)+")"); pigContext.registerFunction(namespace + key, funcspec); }} Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Algebraic UDF in JRubyclass Count < AlgebraicPigUdf output_schema Schema.long def initial t t.nil? ? 0 : 1 end def intermed t return 0 if t.nil? t.flatten.inject(:+) end def final t intermed(t) endend Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Pig Embedding• Embed Pig inside scripting language – Python – JavaScript• Algorithms which cannot complete using one Pig script – Iterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc – Parallel execution Random forrest – Divide and Conquer – Branching Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Pig Embeddingfrom org.apache.pig.scripting import Pig Compile  Pig  input= ":INPATH:/singlefile/studenttab10k” Script  P = Pig.compile("""A = load $in as (name, age, gpa); store A into ’output;""") Bind  Variables  Q = P.bind({in:input})result = Q.runSingle() Launch  Pig  Script  if result.isSuccessful(): print "Pig job PASSED”else: raise "Pig job FAILED" Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Pig Embedding • Running embeded Pig script pig sample.py • What happen within Pig? Pig Script Python Python Script Scriptsample.py Pig Jython Pig Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Nested Operator• Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate C0; }• Prior Pig 0.10, supported nested operator – DISTINCT, FILTER, LIMIT, and ORDER BY• New operators added in 0.10 – CROSS, FOREACH Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Nested Cross/ForeachA = LOAD ’studenttab10k as (name:chararray, age:int, gpa:double);B = LOAD ’votertab10k as (name:chararray, age:int, registration,contributions:double);C = cogroup A by name, B by name;D = foreach C { C1 = filter A by gpa > 4; C2 = filter B by contributions > 500; C3 = cross C1, C2; C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, _), (chararray)contributions); generate flatten(C4);}store D into ’output Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Misc Loaders• HBaseStorage• CassandraStorage• AvroStorage• JsonLoader/JsonStorage Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. New operators to come• Will be available in Pig 0.11 – RANK – A distributed RANK implementation for Pig – CUBE Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011