Pig programming is more fun: New features in Pig



Daniel Dai (@daijy)
Thejas Nair (@thejasn)




© Hortonworks Inc. 2011                        Page 1
What is Apache Pig?
  Pig Latin, a high level                                                An engine that
  data processing                                                        executes Pig
  language.                                                              Latin locally or on
                                                                         a Hadoop cluster.




Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

                  Architecting the Future of Big Data
                                                                                               Page 2
                  © Hortonworks Inc. 2011
Pig-latin example
• Query : Get the list of web pages visited by users whose
  age is between 20 and 29 years.

USERS = load „users‟ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load „pages‟ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;



      Architecting the Future of Big Data
                                                         Page 3
      © Hortonworks Inc. 2011
Why pig ?
• Faster development
  – Fewer lines of code
  – Don‟t re-invent the wheel

• Flexible
  – Metadata is optional
  – Extensible
  – Procedural programming



         Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

     Architecting the Future of Big Data
                                                                          Page 4
     © Hortonworks Inc. 2011
Before pig 0.9
   p1.pig                           p2.pig   p3.pig




     Architecting the Future of Big Data
                                                      Page 5
     © Hortonworks Inc. 2011
With pig macros
                                  p1.pig           p2.pig   p3.pig

macro1.pig                                                           macro2.pig




             Architecting the Future of Big Data
                                                                           Page 6
             © Hortonworks Inc. 2011
With pig macros
  p1.pig                                   p1.pig   rm_bots.pig




                                                    get_top.pig




     Architecting the Future of Big Data
                                                           Page 7
     © Hortonworks Inc. 2011
Pig macro example
• Page_views data : (url, timestamp, uname, …)
• Find
  1. top 5 users (uname) by page views
  2. top 10 most visited urls




      Architecting the Future of Big Data
                                                 Page 8
      © Hortonworks Inc. 2011
Pig Macro example
page_views = LOAD ..                           /* top x macro */
/* get top 5 users by page view */             DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname;                     RETURNS top_num_recs {
u_count = FOREACH .. COUNT ..                   grped = GROUP $rel by $col;
ord_u_count = ORDER u_count ..                  cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5;                ord_cnt = ORDER .. by cnt;
DUMP top_5_users;                               $top_num_recs = LIMIT.. $topNum;
                                               }
/* get top 10 urls by page view */             -----------------------------------------
url_grp = GROUP .. by url;                     page_views = LOAD ..
url_count = FOREACH .. COUNT .                 /* get top 5 users by page view */
ord_url_count = ORDER url_count..              top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10;              uname, 5);
DUMP top_10_urls;                              …



         Architecting the Future of Big Data
                                                                                  Page 9
         © Hortonworks Inc. 2011
Pig macro
• Coming soon – piggybank with pig macros




     Architecting the Future of Big Data
                                            Page 10
     © Hortonworks Inc. 2011
Writing data flow program
• Writing a complex data pipeline is an iterative process

     Load                                   Load



   Transform                                Join



                                            Group   Transform   Filter




      Architecting the Future of Big Data
                                                                         Page 11
      © Hortonworks Inc. 2011
Writing data flow program


    Load                                   Load



  Transform                                Join



                                           Group   Transform        Filter


                                                               No output! 




     Architecting the Future of Big Data
                                                                              Page 12
     © Hortonworks Inc. 2011
Writing data flow program
• Debug!

      Load                                   Load


                                                     Was join on
   Transform                                 Join      wrong
                                                       attributes?


Bug in                                       Group       Transform           Filter
   transform?

                                                                     Did filter drop
                                                                         everything?



       Architecting the Future of Big Data
                                                                                       Page 13
       © Hortonworks Inc. 2011
Common approaches to debug
• Running on real (large) data
   –Inefficient, takes longer
• Running on (small) samples
   –Empty results on join, selective filters




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
Pig illustrate command
• Objective- Show examples for i/o of each statement that
  are
  –Realistic
  –Complete
  –Concise
  –Generated fast
• Steps
  –Downstream – sample and process
  –Prune
  –Upstream – generate realistic missing classes of examples
  –Prune


      Architecting the Future of Big Data
                                                          Page 15
      © Hortonworks Inc. 2011
Illustrate command demo




   Architecting the Future of Big Data
                                         Page 16
   © Hortonworks Inc. 2011
Pig relation-as-scalar
• In pig each statement alias is a relation
   –Relation is a set of records
• Task: Get list of pages whose load time was more
  than average.
• Steps
   1. Compute average load time
   2. Get list of pages whose load time is > average




      Architecting the Future of Big Data
                                                       Page 17
      © Hortonworks Inc. 2011
Pig relation-as-scalar
• Step 1 is like
 .. = load ..
 ..= group ..
 al_rel = foreach .. AVG(ltime) as avg_ltime;


• Step 2 looks like
   page_views = load „pviews.txt‟ as
                   (url, ltime, ..);

   slow_views = filter page_views by
               ltime > avg_ltime




       Architecting the Future of Big Data
                                                Page 18
       © Hortonworks Inc. 2011
Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
   –Join result of step 1 with students relation, or
   –Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
   slow_views = filter page_views by
               ltime > al_rel.avg_ltime


   –Runtime exception if al_rel has more than one record.




      Architecting the Future of Big Data
                                                             Page 19
      © Hortonworks Inc. 2011
UDF in Scripting Language
• Benefit
   –Use legacy code
   –Use library in scripting language
   –Leverage Hadoop for non-Java programmer
• Currently supported language
   –Python (0.8)
   –JavaScript (0.8)
   –Ruby (0.10)
• Extensible Interface
   –Minimum effort to support another language



      Architecting the Future of Big Data
                                                 Page 20
      © Hortonworks Inc. 2011
Writing a Python UDF
Write a Python UDF                              register 'util.py' using jython as util;

@outputSchema("word:chararray")                 B = foreach A generate util.square(i);
def concat(word):
  return word + word
                                                 • Invoke Python functions when
                                                   needed
@outputSchemaFunction("squareSchema")            • Type conversion
def square(num):                                     – Python simple type <-> Pig
                                                       simple type
  if num == None:
                                                     – Python Array <-> Pig Bag
      return None                                    – Python Dict <-> Pig Map
  return ((num)*(num))                               – Pyton Tuple <-> Pig Tuple

def squareSchema(input):
  return input

          Architecting the Future of Big Data
                                                                                    Page 21
          © Hortonworks Inc. 2011
Use NLTK in Pig
• Example
register ‟nltk_util.py' using jython as nltk;    Pig eats everything
……
B = foreach A generate nltk.tokenize(sentence)

                                                           Tokenize
  nltk_util.py
                                                           Stemming
import nltk
porter = nltk.PorterStemmer()                          (Pig)
@outputSchema("words:{(word:chararray)}")              (eat)
def tokenize(sentence):                             (everything)
  tokens = nltk.word_tokenize(sentence)
  words = [porter.stem(t) for t in tokens]
  return words



       Architecting the Future of Big Data
                                                                   Page 22
       © Hortonworks Inc. 2011
Comparison with Pig Streaming

                                            Pig Streaming             Scripting UDF

                                   B = stream A through `perl    B = foreach A generate
    Syntax
                                           sample.pl`;          myfunc.concat(a0, a1), a2;
                                                                function parameter/return
                                              stdin/tout
 Input/Output                                                             value
                                            entire relation
                                                                     particular fields

                                  Need to parse input/convert       Type conversion is
Type Conversion
                                             type                       automatic

                                     Every streaming operator   Organize the functions into
  Modularize
                                      need a separate script             module




      Architecting the Future of Big Data
                                                                                         Page 23
      © Hortonworks Inc. 2011
Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {               Convert Pig input
                                                                   into Python
   public Object exec(Tuple tuple) {
     PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
     PyObject result = f.__call__(params);      Invoke Python UDF
     return JythonUtils.pythonToPig(result);
   }                                         Convert result to Pig
   public Schema outputSchema(Schema input) {
     PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
     return Utils.getSchemaFromString(outputSchemaDef.toString());
   }
}




         Architecting the Future of Big Data
                                                                             Page 24
         © Hortonworks Inc. 2011
Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
   public void registerFunctions(String path, String namespace, PigContext
pigContext) {
        myudf.py
        def square(num):
          ……                                     square   JythonFunction(“square”)
        def concat(word):                        concat   JythonFunction(“concat”)
          ……
        def count(bag):                          count    JythonFunction(“count”)
          ……
    }
}



           Architecting the Future of Big Data
                                                                                     Page 25
           © Hortonworks Inc. 2011
Algebraic UDF in JRuby
class SUM < AlgebraicPigUdf
   output_schema Schema.long

  def initial num
    num                                          Initial Function
  end

  def intermed num
    num.flatten.inject(:+)                    Intermediate Function
  end

  def final num
    intermed(num)                                Final Function
  end

end


        Architecting the Future of Big Data
                                                                      Page 26
        © Hortonworks Inc. 2011
Pig Embedding
• Embed Pig inside scripting language
  –Python
  –JavaScript
• Algorithms which cannot complete using one Pig script
  –Iterative algorithm
       – PageRank, Kmeans, Neural Network, Apriori, etc

  – Parallel Independent execution
       – Ensemble

  – Divide and Conquer
  – Branching




      Architecting the Future of Big Data
                                                          Page 27
      © Hortonworks Inc. 2011
Pig Embedding
from org.apache.pig.scripting import Pig
                                                                   Compile Pig
input= ":INPATH:/singlefile/studenttab10k”                            Script


P = Pig.compile("""A = load '$in' as (name, age, gpa);
                   store A into ’output';""")

Q = P.bind({'in':input})                        Bind Variables


result = Q.runSingle()                         Launch Pig Script

result = stats.result('A')

for t in result.iterator():                     Iterate result
   print t


         Architecting the Future of Big Data
                                                                                 Page 28
         © Hortonworks Inc. 2011
Convergence Example
P = Pig.compile(“““DEFINE myudf MyUDF('$param');
                   A = load ‟input‟;
                   B = foreach A generate MyUDF(*);
                   store B into „output‟;””” )

while True:
  Q = P.bind({‟ param':new_parameter})              Bind to new parameter
  results = Q.runSingle()
  iter = results.result("result").iterator()
  if converged:                      Convergence check
      break

  new_parameter = xxxxxx                      Change parameter




        Architecting the Future of Big Data
                                                                            Page 29
        © Hortonworks Inc. 2011
Pig Embedding
 • Running embeded Pig script
    pig sample.py                                                   while True:
 • What happen within Pig?                                            Q = P.bind()
                                                                      results = Q.runSingle()
                                                       While Loop     converge?

                                                                     Pig
                                                                     Script

             Pytho                            Pytho
             n                                n
sample.py    Script                  Pig      Script
                                                         Jython                      Pig




                                                          End


        Architecting the Future of Big Data
                                                                                                Page 30
        © Hortonworks Inc. 2011
Nested Operator
• Nested Operator: Operator inside foreach
  B = group A by name;
  C = foreach B {
    C0 = limit A 10;
    generate flatten(C0);
  }


• Prior Pig 0.10, supported nested operator
  –DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
  –CROSS, FOREACH



       Architecting the Future of Big Data
                                              Page 31
       © Hortonworks Inc. 2011
Nested Cross/ForEach
           ì(i0, a)ü                                              ì(i0, 0)ü
    A=     í       ý                                         B=   í       ý
           î(i0, b)þ                                              î(i0,1) þ

                                        ì ì aü ì 0 ü ü
                                        ï ï            ï
CoGroup A, B                 C=         í(i0, í ý, í ý)ý
                                        ï ïbþ î1 þ ï
                                        î î            þ
                                           ì     ì(a, 0)üü          C = CoGroup A, B;
                                           ï     ï      ïï
Cross A, B                                 ï     ï(a,1) ïï          D = ForEach C {
                                           í(i0, í      ýý
                                           ï     ï(b, 0)ïï            X = Cross A, B;
                                           ï
                                           î     ï(b,1) ïï
                                                 î      þþ            Y = ForEach X generate
                                                                            CONCAT(f1, f2);
                 ì     ì(a0)üü
                 ï     ï     ïï                                       Generate Y;
ForEach … CONCAT ï     ï(a1) ïï
                 í(i0, í     ýý                                     }
                 ï     ï(b0)ïï
                 ï
                 î     ï(b1) ïï
                       î     þþ
         Architecting the Future of Big Data
                                                                                               Page 32
         © Hortonworks Inc. 2011
HCatalog Integration
• Hcatalog

             Pig                            Map Reduce   Hive




                                             HCatalog



• HCatLoader/HCatStorage
  –Load/Store from HCatalog from Pig
• HCatalog DDL Integration (Pig 0.11)
  –sql “create table student(name string, age int, gpa double);”

      Architecting the Future of Big Data
                                                                Page 33
      © Hortonworks Inc. 2011
Misc Loaders
• HBaseStorage
  –Pig builtin
• AvroStorage
  –Piggybank
• CassandraStorage
  –In Cassandra code base
• MongoStorage
  –In Mongo DB code base
• JsonLoader/JsonStorage
  –Pig builtin



     Architecting the Future of Big Data
                                           Page 34
     © Hortonworks Inc. 2011
Talend
Enterprise Data Integration
• Talend Open Studio for Big Data
   – Feature-rich Job Designer
   – Rich palette of pre-built templates
   – Supports HDFS, Pig, Hive, HBase, HCatalog
   – Apache-licensed, bundled with HDP


• Key benefits
   – Graphical development
   – Robust and scalable execution
   – Broadest connectivity to support
     all systems:
     450+ components
   – Real-time debugging




       © Hortonworks Inc. 2011                   Page 35
Questions




   Architecting the Future of Big Data
                                         Page 36
   © Hortonworks Inc. 2011

Pig programming is more fun: New features in Pig

  • 1.
    Pig programming ismore fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2.
    What is ApachePig? Pig Latin, a high level An engine that data processing executes Pig language. Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3.
    Pig-latin example • Query: Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load „users‟ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load „pages‟ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4.
    Why pig ? •Faster development – Fewer lines of code – Don‟t re-invent the wheel • Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5.
    Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6.
    With pig macros p1.pig p2.pig p3.pig macro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7.
    With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8.
    Pig macro example •Page_views data : (url, timestamp, uname, …) • Find 1. top 5 users (uname) by page views 2. top 10 most visited urls Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9.
    Pig Macro example page_views= LOAD .. /* top x macro */ /* get top 5 users by page view */ DEFINE topCount (rel, col, topNum) u_grp = GROUP .. by uname; RETURNS top_num_recs { u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col; ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel).. top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt; DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; } /* get top 10 urls by page view */ ----------------------------------------- url_grp = GROUP .. by url; page_views = LOAD .. url_count = FOREACH .. COUNT . /* get top 5 users by page view */ ord_url_count = ORDER url_count.. top_5_users = topCount(page_views, top_10_urls = LIMIT ord_url.. 10; uname, 5); DUMP top_10_urls; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10.
    Pig macro • Comingsoon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11.
    Writing data flowprogram • Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12.
    Writing data flowprogram Load Load Transform Join Group Transform Filter No output!  Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13.
    Writing data flowprogram • Debug! Load Load Was join on Transform Join wrong attributes? Bug in Group Transform Filter transform? Did filter drop everything? Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14.
    Common approaches todebug • Running on real (large) data –Inefficient, takes longer • Running on (small) samples –Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15.
    Pig illustrate command •Objective- Show examples for i/o of each statement that are –Realistic –Complete –Concise –Generated fast • Steps –Downstream – sample and process –Prune –Upstream – generate realistic missing classes of examples –Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16.
    Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17.
    Pig relation-as-scalar • Inpig each statement alias is a relation –Relation is a set of records • Task: Get list of pages whose load time was more than average. • Steps 1. Compute average load time 2. Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18.
    Pig relation-as-scalar • Step1 is like .. = load .. ..= group .. al_rel = foreach .. AVG(ltime) as avg_ltime; • Step 2 looks like page_views = load „pviews.txt‟ as (url, ltime, ..); slow_views = filter page_views by ltime > avg_ltime Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19.
    Pig relation-as-scalar • Gettingresults of step 1 (average_gpa) –Join result of step 1 with students relation, or –Write result into file, then use udf to read from file • Pig scalar feature now simplifies this- slow_views = filter page_views by ltime > al_rel.avg_ltime –Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20.
    UDF in ScriptingLanguage • Benefit –Use legacy code –Use library in scripting language –Leverage Hadoop for non-Java programmer • Currently supported language –Python (0.8) –JavaScript (0.8) –Ruby (0.10) • Extensible Interface –Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21.
    Writing a PythonUDF Write a Python UDF register 'util.py' using jython as util; @outputSchema("word:chararray") B = foreach A generate util.square(i); def concat(word): return word + word • Invoke Python functions when needed @outputSchemaFunction("squareSchema") • Type conversion def square(num): – Python simple type <-> Pig simple type if num == None: – Python Array <-> Pig Bag return None – Python Dict <-> Pig Map return ((num)*(num)) – Pyton Tuple <-> Pig Tuple def squareSchema(input): return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22.
    Use NLTK inPig • Example register ‟nltk_util.py' using jython as nltk; Pig eats everything …… B = foreach A generate nltk.tokenize(sentence) Tokenize nltk_util.py Stemming import nltk porter = nltk.PorterStemmer() (Pig) @outputSchema("words:{(word:chararray)}") (eat) def tokenize(sentence): (everything) tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23.
    Comparison with PigStreaming Pig Streaming Scripting UDF B = stream A through `perl B = foreach A generate Syntax sample.pl`; myfunc.concat(a0, a1), a2; function parameter/return stdin/tout Input/Output value entire relation particular fields Need to parse input/convert Type conversion is Type Conversion type automatic Every streaming operator Organize the functions into Modularize need a separate script module Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24.
    Writing a ScriptEngine Writing a bridge UDF class JythonFunction extends EvalFunc<Object> { Convert Pig input into Python public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = f.__call__(params); Invoke Python UDF return JythonUtils.pythonToPig(result); } Convert result to Pig public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); } } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25.
    Writing a ScriptEngine Register scripting UDF register 'util.py' using jython as util; What happens in Pig class JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContext pigContext) { myudf.py def square(num): …… square JythonFunction(“square”) def concat(word): concat JythonFunction(“concat”) …… def count(bag): count JythonFunction(“count”) …… } } Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26.
    Algebraic UDF inJRuby class SUM < AlgebraicPigUdf output_schema Schema.long def initial num num Initial Function end def intermed num num.flatten.inject(:+) Intermediate Function end def final num intermed(num) Final Function end end Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27.
    Pig Embedding • EmbedPig inside scripting language –Python –JavaScript • Algorithms which cannot complete using one Pig script –Iterative algorithm – PageRank, Kmeans, Neural Network, Apriori, etc – Parallel Independent execution – Ensemble – Divide and Conquer – Branching Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28.
    Pig Embedding from org.apache.pig.scriptingimport Pig Compile Pig input= ":INPATH:/singlefile/studenttab10k” Script P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""") Q = P.bind({'in':input}) Bind Variables result = Q.runSingle() Launch Pig Script result = stats.result('A') for t in result.iterator(): Iterate result print t Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29.
    Convergence Example P =Pig.compile(“““DEFINE myudf MyUDF('$param'); A = load ‟input‟; B = foreach A generate MyUDF(*); store B into „output‟;””” ) while True: Q = P.bind({‟ param':new_parameter}) Bind to new parameter results = Q.runSingle() iter = results.result("result").iterator() if converged: Convergence check break new_parameter = xxxxxx Change parameter Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30.
    Pig Embedding •Running embeded Pig script pig sample.py while True: • What happen within Pig? Q = P.bind() results = Q.runSingle() While Loop converge? Pig Script Pytho Pytho n n sample.py Script Pig Script Jython Pig End Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31.
    Nested Operator • NestedOperator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate flatten(C0); } • Prior Pig 0.10, supported nested operator –DISTINCT, FILTER, LIMIT, and ORDER BY • New operators added in 0.10 –CROSS, FOREACH Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32.
    Nested Cross/ForEach ì(i0, a)ü ì(i0, 0)ü A= í ý B= í ý î(i0, b)þ î(i0,1) þ ì ì aü ì 0 ü ü ï ï ï CoGroup A, B C= í(i0, í ý, í ý)ý ï ïbþ î1 þ ï î î þ ì ì(a, 0)üü C = CoGroup A, B; ï ï ïï Cross A, B ï ï(a,1) ïï D = ForEach C { í(i0, í ýý ï ï(b, 0)ïï X = Cross A, B; ï î ï(b,1) ïï î þþ Y = ForEach X generate CONCAT(f1, f2); ì ì(a0)üü ï ï ïï Generate Y; ForEach … CONCAT ï ï(a1) ïï í(i0, í ýý } ï ï(b0)ïï ï î ï(b1) ïï î þþ Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • 33.
    HCatalog Integration • Hcatalog Pig Map Reduce Hive HCatalog • HCatLoader/HCatStorage –Load/Store from HCatalog from Pig • HCatalog DDL Integration (Pig 0.11) –sql “create table student(name string, age int, gpa double);” Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • 34.
    Misc Loaders • HBaseStorage –Pig builtin • AvroStorage –Piggybank • CassandraStorage –In Cassandra code base • MongoStorage –In Mongo DB code base • JsonLoader/JsonStorage –Pig builtin Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • 35.
    Talend Enterprise Data Integration •Talend Open Studio for Big Data – Feature-rich Job Designer – Rich palette of pre-built templates – Supports HDFS, Pig, Hive, HBase, HCatalog – Apache-licensed, bundled with HDP • Key benefits – Graphical development – Robust and scalable execution – Broadest connectivity to support all systems: 450+ components – Real-time debugging © Hortonworks Inc. 2011 Page 35
  • 36.
    Questions Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011