Pig programming is more fun: New features in Pig



Daniel Dai (@daijy)
Thejas Nair (@thejasn)




© Hortonworks Inc. 2011                        Page 1
What is Apache Pig?
  Pig Latin, a high level                                                An engine that
  data processing                                                        executes Pig Latin
  language.                                                              locally or on a
                                                                         Hadoop cluster.




Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

                  Architecting the Future of Big Data
                                                                                              Page 2
                  © Hortonworks Inc. 2011
Pig-latin example
• Query : Get the list of pages visited by users whose age is
  between 20 and 25 years.

users = load users as (name, age);

users_18_to_25 = filter users by age > 20 and age <= 25;

page_views = load pages as (user, url);

page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;

      Architecting the Future of Big Data
                                                          Page 3
      © Hortonworks Inc. 2011
Why pig ?
• Faster development
  –  Fewer lines of code
  –  Don’t re-invent the wheel

• Flexible
  –  Metadata is optional
  –  Extensible
  –  Procedural programming



         Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

     Architecting the Future of Big Data
                                                                          Page 4
     © Hortonworks Inc. 2011
Before pig 0.9
   p1.pig                           p2.pig   p3.pig




     Architecting the Future of Big Data
                                                      Page 5
     © Hortonworks Inc. 2011
With pig macros
                                  p1.pig           p2.pig   p3.pig

macro1.pig                                                           macro2.pig




             Architecting the Future of Big Data
                                                                           Page 6
             © Hortonworks Inc. 2011
With pig macros
  p1.pig                                   p1.pig   rm_bots.pig




                                                    get_top.pig




     Architecting the Future of Big Data
                                                           Page 7
     © Hortonworks Inc. 2011
Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.




      Architecting the Future of Big Data
                                                     Page 8
      © Hortonworks Inc. 2011
Pig Macro example
page_views = LOAD ..                           /* top x macro */
/* get top 5 users by page view */             DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname;                     RETURNS top_num_recs {
u_count = FOREACH .. COUNT ..                   grped = GROUP $rel by $col;
ord_u_count = ORDER u_count ..                  cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5;                ord_cnt = ORDER .. by cnt;
DUMP top_5_users;                               $top_num_recs = LIMIT.. $topNum;
                                               }
/* get top 10 urls by page view */             -----------------------------------------
url_grp = GROUP .. by url;                     page_views = LOAD ..
url_count = FOREACH .. COUNT .                 /* get top 5 users by page view */
ord_url_count = ORDER url_count..              top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10;              uname, 5);
DUMP top_10_urls;                              DUMP top_5_users;
                                               …


         Architecting the Future of Big Data
                                                                                  Page 9
         © Hortonworks Inc. 2011
Pig macro
• Coming soon – piggybank with pig macros




     Architecting the Future of Big Data
                                            Page 10
     © Hortonworks Inc. 2011
Writing data flow program
• Writing a complex data pipeline is an iterative process

     Load                                   Load



   Transform                                Join



                                            Group   Transform   Filter




      Architecting the Future of Big Data
                                                                         Page 11
      © Hortonworks Inc. 2011
Writing data flow program


    Load                                   Load



  Transform                                Join



                                           Group   Transform         Filter


                                                               No output! L




     Architecting the Future of Big Data
                                                                               Page 12
     © Hortonworks Inc. 2011
Writing data flow program
• Debug!

        Load                                   Load


                                                       Was	
  join	
  on	
  
    Transform                                  Join      wrong	
  
                                                         a2ributes?	
  


Bug	
  in	
                                    Group          Transform                    Filter
   transform?	
  

                                                                               Did	
  filter	
  drop	
  
                                                                                    everything?	
  



         Architecting the Future of Big Data
                                                                                                          Page 13
         © Hortonworks Inc. 2011
Common approaches to debug
• Running on real (large) data
  – Inefficient, takes longer
• Running on (small) samples
  – Empty results on join, selective filters




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
Pig illustrate command
• Objective- Show examples for i/o of each statement that
  are
  – Realistic
  – Complete
  – Concise
  – Generated fast
• Steps
  – Downstream – sample and process
  – Prune
  – Upstream – generate realistic missing classes of examples
  – Prune


      Architecting the Future of Big Data
                                                           Page 15
      © Hortonworks Inc. 2011
Illustrate command demo




   Architecting the Future of Big Data
                                         Page 16
   © Hortonworks Inc. 2011
Pig relation-as-scalar
• In pig each statement alias is a relation
   – Relation is a set of records
• Task: Get list of pages whose load time was more
  than average.
• Steps
   1.  Compute average load time
   2.  Get list of pages whose load time is > average




      Architecting the Future of Big Data
                                                        Page 17
      © Hortonworks Inc. 2011
Pig relation-as-scalar
• Step 1 is like
  .. = load ..!
  ..= group ..!
  al_rel = foreach .. AVG(ltime) as avg_ltime;!


• Step 2 looks like
   page_views = load ‘pviews.txt’ as !
                               (url, ltime, ..);!
   !
   slow_views = filter page_views by !
                         ltime > avg_ltime!




       Architecting the Future of Big Data
                                                    Page 18
       © Hortonworks Inc. 2011
Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
   – Join result of step 1 with students relation, or
   – Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
   slow_views = filter page_views by !
                         ltime > al_rel.avg_ltime!


   – Runtime exception if al_rel has more than one record.




       Architecting the Future of Big Data
                                                              Page 19
       © Hortonworks Inc. 2011
UDF in Scripting Language
• Benefit
   – Use legacy code
   – Use library in scripting language
   – Leverage Hadoop for non-Java programmer
• Currently supported language
   – Python
   – JavaScript
   – Ruby
• Extensible Interface
   – Minimum effort to support another language



      Architecting the Future of Big Data
                                                  Page 20
      © Hortonworks Inc. 2011
Writing a Jython UDF
Write a Jython UDF                             •  Invoke Jython UDF when
                                                  needed
@outputSchema("word:chararray")                •  Type conversion
def concat(word):                                  –  Simple type
  return word + word                               –  Python Array <-> Pig Bag
                                                   –  Python Dict <-> Pig Map
                                                   –  Pyton Tuple <-> Pig Tuple

@outputSchemaFunction("squareSchema")          •  Convey schema to Pig
def square(num):                                   –  outputSchema
                                                   –  outputSchemaFunction
  if num == None:
      return None                              register 'util.py' using jython as util;
  return ((num)*(num))
                                               B = foreach A generate util.square
def squareSchema(input):                       (i));
  return input

         Architecting the Future of Big Data
                                                                                  Page 21
         © Hortonworks Inc. 2011
Use NLTK in Pig
• Example
   register ’nltk_util.py' using jython as nltk;
   ……
   B = foreach A generate nltk.tokenize(sentence)

 nltk_util.py
   import nltk
   porter = nltk.PorterStemmer()
   @outputSchema("words:{(word:chararray)}")
   def tokenize(sentence):
     tokens = nltk.word_tokenize(sentence)
     words = [porter.stem(t) for t in tokens]
     return words



      Architecting the Future of Big Data
                                                    Page 22
      © Hortonworks Inc. 2011
Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
   public Object exec(Tuple tuple) {
     PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
     PyObject result = function.__call__(params);
     return JythonUtils.pythonToPig(result);
   }
   public Schema outputSchema(Schema input) {
     PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
     return Utils.getSchemaFromString(outputSchemaDef.toString());
   }
}




        Architecting the Future of Big Data
                                                                            Page 23
        © Hortonworks Inc. 2011
Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
   public void registerFunctions(String path, String namespace, PigContext
pigContext) {
     PythonInterpreter pi = Interpreter.interpreter;
     pi.execfile(path);
     for (PyTuple item : pi.getLocals().items())
        funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
                   + path + "','" + item. get(0)+"')");
        pigContext.registerFunction(namespace + key, funcspec);
   }
}



          Architecting the Future of Big Data
                                                                            Page 24
          © Hortonworks Inc. 2011
Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
   output_schema Schema.long

  def initial t
    t.nil? ? 0 : 1
  end

  def intermed t
    return 0 if t.nil?
    t.flatten.inject(:+)
  end

  def final t
    intermed(t)
  end

end


          Architecting the Future of Big Data
                                                Page 25
          © Hortonworks Inc. 2011
Pig Embedding
• Embed Pig inside scripting language
  – Python
  – JavaScript
• Algorithms which cannot complete using one Pig script
  – Iterative algorithm
  PageRank, Kmeans, Neural Network, Apriori, etc
  – Parallel execution
  Random forrest
  – Divide and Conquer
  – Branching




      Architecting the Future of Big Data
                                                          Page 26
      © Hortonworks Inc. 2011
Pig Embedding
from org.apache.pig.scripting import Pig

                                                                             Compile	
  Pig	
  
input= ":INPATH:/singlefile/studenttab10k”
                                                                                Script	
  

P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")

                                               Bind	
  Variables	
  
Q = P.bind({'in':input})

result = Q.runSingle()                         Launch	
  Pig	
  Script	
  

if result.isSuccessful():
    print "Pig job PASSED”
else:
    raise "Pig job FAILED"



         Architecting the Future of Big Data
                                                                                                  Page 27
         © Hortonworks Inc. 2011
Pig Embedding
 • Running embeded Pig script
    pig sample.py
 • What happen within Pig?
                                                                Pig
                                                                Script


             Python                           Python
             Script                           Script
sample.py                            Pig               Jython            Pig




        Architecting the Future of Big Data
                                                                               Page 28
        © Hortonworks Inc. 2011
Nested Operator
• Nested Operator: Operator inside foreach
  B = group A by name;
  C = foreach B {
    C0 = limit A 10;
    generate C0;
  }


• Prior Pig 0.10, supported nested operator
  – DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
  – CROSS, FOREACH



      Architecting the Future of Big Data
                                              Page 29
      © Hortonworks Inc. 2011
Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
   C1 = filter A by gpa > 4;
   C2 = filter B by contributions > 500;
   C3 = cross C1, C2;
   C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
   generate flatten(C4);
}
store D into ’output'




       Architecting the Future of Big Data
                                                                      Page 30
       © Hortonworks Inc. 2011
Misc Loaders
• HBaseStorage
• CassandraStorage
• AvroStorage
• JsonLoader/JsonStorage




     Architecting the Future of Big Data
                                           Page 31
     © Hortonworks Inc. 2011
New operators to come
• Will be available in Pig 0.11
   – RANK
       – A distributed RANK implementation for Pig

   – CUBE




      Architecting the Future of Big Data
                                                     Page 32
      © Hortonworks Inc. 2011

Pig programming is fun

  • 1.
    Pig programming ismore fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2.
    What is ApachePig? Pig Latin, a high level An engine that data processing executes Pig Latin language. locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3.
    Pig-latin example • Query :Get the list of pages visited by users whose age is between 20 and 25 years. users = load users as (name, age); users_18_to_25 = filter users by age > 20 and age <= 25; page_views = load pages as (user, url); page_views_u18_to_25 = join users_18_to_25 by name, page_views by user; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4.
    Why pig ? • Fasterdevelopment –  Fewer lines of code –  Don’t re-invent the wheel • Flexible –  Metadata is optional –  Extensible –  Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5.
    Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6.
    With pig macros p1.pig p2.pig p3.pig macro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7.
    With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8.
    Pig macro example • Page_viewsdata : (user_name, url, timestamp, …) • Find top 5 users by page views • Find top 10 most visited pages. Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9.
    Pig Macro example page_views= LOAD .. /* top x macro */ /* get top 5 users by page view */ DEFINE topCount (rel, col, topNum) u_grp = GROUP .. by uname; RETURNS top_num_recs { u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col; ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel).. top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt; DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; } /* get top 10 urls by page view */ ----------------------------------------- url_grp = GROUP .. by url; page_views = LOAD .. url_count = FOREACH .. COUNT . /* get top 5 users by page view */ ord_url_count = ORDER url_count.. top_5_users = topCount(page_views, top_10_urls = LIMIT ord_url.. 10; uname, 5); DUMP top_10_urls; DUMP top_5_users; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10.
    Pig macro • Coming soon– piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11.
    Writing data flowprogram • Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12.
    Writing data flowprogram Load Load Transform Join Group Transform Filter No output! L Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13.
    Writing data flowprogram • Debug! Load Load Was  join  on   Transform Join wrong   a2ributes?   Bug  in   Group Transform Filter transform?   Did  filter  drop   everything?   Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14.
    Common approaches todebug • Running on real (large) data – Inefficient, takes longer • Running on (small) samples – Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15.
    Pig illustrate command • Objective-Show examples for i/o of each statement that are – Realistic – Complete – Concise – Generated fast • Steps – Downstream – sample and process – Prune – Upstream – generate realistic missing classes of examples – Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16.
    Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17.
    Pig relation-as-scalar • In pigeach statement alias is a relation – Relation is a set of records • Task: Get list of pages whose load time was more than average. • Steps 1.  Compute average load time 2.  Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18.
    Pig relation-as-scalar • Step 1is like .. = load ..! ..= group ..! al_rel = foreach .. AVG(ltime) as avg_ltime;! • Step 2 looks like page_views = load ‘pviews.txt’ as ! (url, ltime, ..);! ! slow_views = filter page_views by ! ltime > avg_ltime! Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19.
    Pig relation-as-scalar • Getting resultsof step 1 (average_gpa) – Join result of step 1 with students relation, or – Write result into file, then use udf to read from file • Pig scalar feature now simplifies this- slow_views = filter page_views by ! ltime > al_rel.avg_ltime! – Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20.
    UDF in ScriptingLanguage • Benefit – Use legacy code – Use library in scripting language – Leverage Hadoop for non-Java programmer • Currently supported language – Python – JavaScript – Ruby • Extensible Interface – Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21.
    Writing a JythonUDF Write a Jython UDF •  Invoke Jython UDF when needed @outputSchema("word:chararray") •  Type conversion def concat(word): –  Simple type return word + word –  Python Array <-> Pig Bag –  Python Dict <-> Pig Map –  Pyton Tuple <-> Pig Tuple @outputSchemaFunction("squareSchema") •  Convey schema to Pig def square(num): –  outputSchema –  outputSchemaFunction if num == None: return None register 'util.py' using jython as util; return ((num)*(num)) B = foreach A generate util.square def squareSchema(input): (i)); return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22.
    Use NLTK inPig • Example register ’nltk_util.py' using jython as nltk; …… B = foreach A generate nltk.tokenize(sentence) nltk_util.py import nltk porter = nltk.PorterStemmer() @outputSchema("words:{(word:chararray)}") def tokenize(sentence): tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23.
    Writing a ScriptEngine Writing a bridge UDF class JythonFunction extends EvalFunc<Object> { public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = function.__call__(params); return JythonUtils.pythonToPig(result); } public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); } } Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24.
    Writing a ScriptEngine Register scripting UDF register 'util.py' using jython as util; What happens in Pig class JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContext pigContext) { PythonInterpreter pi = Interpreter.interpreter; pi.execfile(path); for (PyTuple item : pi.getLocals().items()) funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('" + path + "','" + item. get(0)+"')"); pigContext.registerFunction(namespace + key, funcspec); } } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25.
    Algebraic UDF inJRuby class Count < AlgebraicPigUdf output_schema Schema.long def initial t t.nil? ? 0 : 1 end def intermed t return 0 if t.nil? t.flatten.inject(:+) end def final t intermed(t) end end Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26.
    Pig Embedding • Embed Piginside scripting language – Python – JavaScript • Algorithms which cannot complete using one Pig script – Iterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc – Parallel execution Random forrest – Divide and Conquer – Branching Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27.
    Pig Embedding from org.apache.pig.scriptingimport Pig Compile  Pig   input= ":INPATH:/singlefile/studenttab10k” Script   P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""") Bind  Variables   Q = P.bind({'in':input}) result = Q.runSingle() Launch  Pig  Script   if result.isSuccessful(): print "Pig job PASSED” else: raise "Pig job FAILED" Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28.
    Pig Embedding • Runningembeded Pig script pig sample.py • What happen within Pig? Pig Script Python Python Script Script sample.py Pig Jython Pig Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29.
    Nested Operator • Nested Operator:Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate C0; } • Prior Pig 0.10, supported nested operator – DISTINCT, FILTER, LIMIT, and ORDER BY • New operators added in 0.10 – CROSS, FOREACH Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30.
    Nested Cross/Foreach A =LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double); B = LOAD ’votertab10k' as (name:chararray, age:int, registration, contributions:double); C = cogroup A by name, B by name; D = foreach C { C1 = filter A by gpa > 4; C2 = filter B by contributions > 500; C3 = cross C1, C2; C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray) contributions); generate flatten(C4); } store D into ’output' Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31.
    Misc Loaders • HBaseStorage • CassandraStorage • AvroStorage • JsonLoader/JsonStorage Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32.
    New operators tocome • Will be available in Pig 0.11 – RANK – A distributed RANK implementation for Pig – CUBE Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011