Pig Experience


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Gives better control in data processing
  • More natural to programmers than flat tuples ,Avoids expensive joins
  • September 11, 2008: Initial Apache open-source releaseNovember 11, 2008:Enhanced type system, rewrote execution pipeline, enhanced use of combinerJanuary 20, 2009: Rework of buffering during data parsing, fragment-replicate join algorithmFebruary 23, 2009: Rework of partitioning function used in ORDER BY to ensure more balanced distribution of keys to reducersApril 20, 2009: Branching execution plans
  • Pig Experience

    1. 1. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience Tilani Gunawardena
    2. 2. Content• Introduction• Background• System Overview• The System & Type Inference• Compilation To Map-Reduce• Plan Execution• Streaming• Performance• Adoption• Project Experience• Future Works
    3. 3. Introduction• Internet companies swimming in data • TBs/day for Yahoo! Or Google! • PBs/day for FaceBook!• Data – unstructured elements • web page text,images – structured elements • web page click records , extracted entity-relationship models• Procesing – Filter, join , count• Data Warehousing?? – Scale -Often not scalable enough – Price-Prohibitively expensive at web scale – SQL- • High level declarative approach • Little control over execution method• The Map-Reduce Appeal ?? – Scale -Scalable due to simpler design, Explicit programming model – Price-Runs on cheap commodity hardware – SQL
    4. 4. MapReduce Disadvantages• Does not directly support complex N-step dataflow• Lacks explicit support for combined processing of multiple data sets – joins and other data matching operations• Frequently needed data manipulation primitives must be coded by hand – Filtering, aggregation ,Join,Projecton,Sorting
    5. 5. Pig• Pigs language Pig Latin – Chooses spot between MapReduce framework and SQL.• Defines a new language to allow better control in large scale data processing• Allow database programmers not to write map and reduce code, which is at too low a level
    6. 6. Pig Latin: Data Types• Rich and Simple Data ModelSimple Types:int, long, double, chararray(string), bytearrayComplex Types:• Map: is an associative array;key:chararray;value: any type• Tuple: Collection of fields e.g. (áppe’, ‘mango’)• Bag: Collection of tuples{ (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’))}
    7. 7. Pig Latin: Input/Output DataInput:queries = LOAD `data.txtUSING BinStorageAS (url, category, pagerank);Output:STORE result INTO `myoutput‘ ;BinStorage: binary storage function in Pig
    8. 8. Pig Latin: Expression Table
    9. 9. Pig Latin: General Syntax• Discarding Unwanted Data: FILTER• Comparison operators such as ==, eq, !=, neq• Logical connectors AND, OR, NOT
    10. 10. Pig Latin: Type Declaration• Pig supports three options for declaring the data types of field – No data types are declared:default is to treat all fields as bytearray. Ex:a = LOAD `data USING BinStorage AS (user); – Declaring types in Pig is to provide them explicitly as part of the AS clause during the LOAD: Ex :a =LOAD `data USING BinStorage AS (user:chararray); – For the load function itself to provide the schema information, which accommodates self-describing data formats such as JSON
    11. 11. Pig Latin: Lazy Conversion of Types• When Pig does need to cast a bytearray to another type because the program applies a type-specic operator, it delays that cast to the point where it is actually necessary.• Status will need to be cast to a chararray• EarnedPoints and possiblePoints will need to be cast to double• These casts will not be done when the data is loaded• They will be done as part of the comparison and division operations• Avoids casting values that are removed by the filter before the result of the cast is used.
    12. 12. Pig Latin-Operators• LOAD : LOAD data [USING function] [AS schema]; where, „data‟ : Name of file or directory USING, AS : Keywords function : Load function. schema : Loader produces data of type specified by schema. If data does not conform to schema, error is generated. ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat); LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp);• STORE : Stores results to file system – STORE alias INTO directory [USING function]; where, alias : name of relation INTO, USING : keywords „directory‟ : storage directory‟s name. If directory already exists, operation fails function: Store function. ex: STORE result INTO `myOutput; STORE query_revenues INTO `myoutput‘ USING myStore();
    13. 13. FOREACH• Generates data transformations based on columns of data.• Eg: X = FOREACH A GENERATE a1, a2;expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); -----------------expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));
    14. 14. GROUP / COGROUP• Groups the data in one or more relations.• GROUP used for 1 relation• COGROUP used for 1 to 127 relations
    15. 15. JOIN (inner)• Performs inner join of 2 or more relations based on common field values.Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)} X = JOIN A BY a1, B BY b1; (1,2,3,1,3) (4,2,1,4,6) (4,2,1,4,9)ORDER BY• Sorts relation based on 1 or more fieldsEg: X = ORDER A BY a3 DESC; (1,2,3) (4,2,1)
    16. 16. System Overview
    17. 17. • A step-by-step dataflow language where computation steps are chained together through the use of variables,• The use of high-level transformations, e.g., GROUP, FILTER• The ability to specify schemas as part of issuing a program• The use of userdened functions (e.g., top10)
    18. 18. Pig allows three modes of user interaction:• Interactive mode:the user is presented with an interactive shell (called Grunt), which accepts Pig commands.• Batch mode:A user submits a prewritten script containing a series of Pig commands• Embedded mode:Pig is also provided as a Java library allowing Pig Latin commands to be submitted via method invocations from a Java program
    19. 19. Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans• Map-Reduce Optimizer• Hadoop Job Manager
    20. 20.  Parser • Verifies program is syntactically correct and that all referenced variables are defined. • Type checking • Schema inference • Verify ability to instantiate classes corresponding to UDF • Confirm existence of streaming executables – Output of parser :Logical plan • One-to-one correspondence between Pig Latin statements & logical operators. • Arranged in directed acyclic graph (DAG) Logical Optimizer • Logical optimizations – Projection pushdown are carried out
    21. 21. Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans• Map-Reduce Optimizer• Hadoop Job Manager
    22. 22. Map-Reduce Compiler:Logical to Physical compilation(1) Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN
    23. 23. Map-Reduce Compiler:Logical to Physical compilation(2)Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs(CO)GROUP operator becomes series of 3 physical operators :- Local and global rearrange operators – Group tuples on same machine and adjacent in data stream; Rearrange – hashing or sorting by key• Package operator -places adjacent same key tuples into a single-tuple packageJOIN operator handled in 2 ways :- rewritten into COGROUP followed by FOREACH operator to perform “flattening” to get parallel hash-join or sort-merge join; Fragment-replicate join – which executes entirely in the map stage or entirely In the reduce stage
    24. 24. Map-Reduce Compiler:Logical to Physical compilation(3)Example for (CO)GROUP Conversion: • (1,R),(2,G) in stream A • (1,B), (2,Y) in stream B• Local Rearrange Operator : – Eg: Converts tuple (1,R) to {1,(1,R)}• Global Rearrange operator: Sort – Eg: Reducer 1 : {1,{(1,R),(1,B)}} Reducer 2: {2,{(2,G),(2,Y)}}• Package Operator: – Places same-key tuples into single-tuple package – Eg: Reducer 1: {1,{(1,R)},{(1,B)}} Reducer 2: {2,{(2,G)},{(2,Y)}}
    25. 25. Map-Reduce Compiler:Logical to Physical compilation(4)3 types of Join operators  Fragment-replicate join • Joins huge table & very small table Huge table fragmented and distributed to mappers (or reducers) • Small table replicates to each machine • Either in map or reduce stage  Parallel-hash join • Map stage - Hashes tables by join key • Reduce stage - Joins fragments of tables – Data with same hash values assigned to 1 reducer  Sort-merge join • Both inputs sorted on join key • Each node gets a fragment of the sorted table, same keys got to the same table • Each node performs join; Only map step is sufficient
    26. 26. Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans• Map-Reduce Optimizer• Hadoop Job Manager
    27. 27. Map-Reduce Compiler:Physical to Map-Reduce Compilation(1) Physical to MapReduce Compilation:• Physical operators assigned to Hadoop stages to minimize no of reduce stages• Local rearrange operator – simply annotates tuples with keys and stream identiers , and lets Hadoop local sort stage to do work• Global rearrange operators removed . Implemented by Hadoop shuffle and merge stages• Load and store operators removed. Hadoop framework reads and writes data
    28. 28. Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans• Map-Reduce Optimizer• Hadoop Job Manager
    29. 29. Map-Reduce Compiler:Branching Plans(1) Branching Plans• More than 1 STORE command – For each branch of split• Data read once; Processed in multiple ways;• Risk of data spilling to disk• SPLIT operator :- Feeds copy of input to each nested sub-planExample 1: Logical Split command – Splits Table• Only Map-Plan clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); SPLIT clicks INTO pages IF pageid IS NOT NULL, // Corresponds to „FILTER‟ of 1st Sub-Plan links IF linkid IS NOT NULL; // Corresponds to „FILTER‟ of 2nd Sub-Plan // 1st Sub-Plan: cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat; STORE cpages INTO `pages;// 2nd Sub-Plan: clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat; STORE clinks INTO `links;
    30. 30. Map-Reduce Compiler:Branching Plans(2)Example2:• Split propagates across map/reduce boundary• No logical SPLIT operator• Compiler inserts physical SPLIT operator• MULTIPLEX operator : Routes tuples to correct sub-plan; In Reduce stage only. clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); goodclicks = FILTER clicks BY viewedat IS NOT NULL; // 1st Sub-Plan: Grouped by „pageid‟ bypage = GROUP goodclicks BY pageid; cntbypage = FOREACH bypage GENERATE group,COUNT(goodclicks); STORE cntbypage INTO `bypage; //2nd Sub-Plan: Grouped by „linkid‟ bylink = GROUP goodclicks BY linkid; cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks); STORE cntbylink INTO `bylink;
    31. 31. Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans• Map-Reduce Optimizer• Hadoop Job Manager
    32. 32. Map-Reduce OptimizerPerforms early partial aggregation in distributive or algebraic aggregation functionseg: for function AVERAGE, the steps are:- a) Initial  e.g. generate (sum, count) pairs  Assigned to map stage. b) intermediate  e.g. combine n (sum,count) pairs into a single pair  Assigned to Combine stage. c) final  e.g. combine n (sum,count) pairs and take the quotient  Assigned to Reduce step
    33. 33. Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans• Map-Reduce Optimizer• Hadoop Job Manager
    34. 34. Hadoop Job Manager• Map-Reduce jobs sorted and submitted to Hadoop for execution• Java jar file generated for Map and Reduce implementation classes and UDF• Map and Reduce classes contain general-purpose dataflow execution engines• Monitor and generates periodic reports• Warnings or errors logged and reported
    35. 35.  Plan Execution • Flow Control – Nested Programs • Memory Management Streaming • Flow Control
    36. 36. PLAN EXECUTION - FLOW CONTROL• Execution of Map or Reduce stage in Physical Plan by Pig• Assume that data flows downward in an execution plan• To control movement of tuples through execution pipeline, 2 models available – Push & Pull(Iterator) Model1) Push Model: Eg: Operator A pushes data to B that operates on it, and pushes the result to C. (A,B and C are physical operators) Difficult to implement for: • UDF with multiple inputs • Binary operators like fragment-replicate join2) Pull Model : Eg: Operator C asks B for its next data item. If B has nothing pending to return, it asks A. When A returns a data item, B operates on it, and returns the result to CAdvantages: Single-threaded implementation : Avoids context-switching overhead Simple APIs for UDFDrawback: Operations over bag nested inside tuple may lead to memory overflow If data flow graph has multiple sinks-operators at branch points may be required to buffer an unbounded number of tuples
    37. 37. PLAN EXECUTION - FLOW CONTROL (2)Solution :Response of operator, when asked to produce tuple a) Return tuple; b) Declare itself finished ; Or c) Return pause signal to indicate not finished; not able to produce output tuple;
    38. 38. PLAN EXECUTION - FLOW CONTROL (3)NESTED PROGRAMS:• Pig Operators invoked over bags nested within tuples• For example: (To compute number of distinct pages and links visited by user) clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); (Alice,Page1,Linnk1,Site1) (John,Page1,Link2,Site2) (John,Page2, Link2,Site3) byuser = GROUP clicks BY userid; (Alice, {(Alice, Page1,Linnk1,Site1)}) (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)}) result = FOREACH byuser { uniqPages = DISTINCT clicks.pageid; uniqLinks = DISTINCT clicks.linkid; GENERATE group, COUNT(uniqPages),COUNT(uniqLinks); }; (Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1) (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )
    39. 39. PLAN EXECUTION - FLOW CONTROL (4)• Outer operator graph contains FOREACH operator• Contains nested operator graph of 2 pipelines• Each pipeline contains DISTINCT and COUNT operators• FOREACH requests tuple T from PACKAGE operator• Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator• Requests tuple from the bottom of pipeline (COUNT operator)• Process repeated for second pipeline• FOREACH operator constructs and returns output tuple
    40. 40. PLAN EXECUTION - FLOW CONTROL• When nested plan is single branching pipeline: clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); (Alice,Page1,Linnk1,Site1) (John,Page1,Link2,Site2) (John,Page2, Link2,NULL) byuser = GROUP clicks BY userid; (Alice, {(Alice, Page1,Linnk1,Site1)}) (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)}) result = FOREACH byuser { fltrd = FILTER clicks BY viewedat IS NOT NULL; uniqPages = DISTINCT fltrd.pageid; uniqLinks = DISTINCT fltrd.linkid; GENERATE group, COUNT(uniqPages), COUNT(uniqLinks); }; (Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1) (John, {(John,Page1,Link2,Site2)} , 1 , 1 )A more complex situation arises when the nested plan is not two independent pipelines but rather asingle branching pipelineSolution:• Pig currently handles this case by duplicating the FILTER operator and producing two independent pipelines, to be executed as explained above.
    41. 41.  Plan Execution • Flow Control – Nested Programs • Memory Management Streaming • Flow Control
    42. 42. PLAN EXECUTION - Memory Management• Hadoop, Pig is implemented in Java.• Java memory management problems during query processing – Java does not allow the developer to control memory allocation and deallocation directly,• naive option :is to increase the JVM memory size limit beyond the physical memory size, and let the virtual memory manager take care of staging data between memory and disk. – Problem: performance degradation.• Better to return an “out-of-memory" error – administrator can adjust the memory management parameters and re-submit the program
    43. 43. PLAN EXECUTION - Memory Management• Memory overflow mostly due to large bags of tuples• Javas MemoryPoolMXBean class notifies low memory situation. If notified, PIG spills excess bags to disk.• Pig estimates bag sizes by sampling few tuples• Memory manager maintains list of Pig bags created in same JVM using linked list of Java WeakReferences• WeakReference ensures garbage collection of bags no longer in use
    44. 44.  Plan Execution • Flow Control – Nested Programs • Memory Management Streaming • Flow Control
    45. 45. STREAMING – FLOW CONTROL• Pig allows User-dened functions (UDFs) – UDFs must be written in Java and must conform to Pigs UDF interface – Has synchronous behaviorStreaming :• Allows data to be pushed through external executables – users are able to intermix relational operations like grouping and filtering with custom or legacy executables.• Streaming executable behaves asynchronously.challenges in implementing streaming in Pig :  fitting it into the iterator model of Pigs execution pipeline • Because of asynchronous behavior of the users executable • STREAM operator that wraps the executable cannot simply pull tuples synchronously as it does with other operators because it does not know what state executable is in. • There may be no output : – executable is waiting to receive more input: the stream operator needs to push new data – executable is still busy processing prior inputs. :the stream operator should wait.
    46. 46. • Single-threaded operator execution model, a deadlock can occur – Pig operator is waiting for the external executable to consume a new input tuple, while at the same time the executable is waiting for its output to be consumedSolution :STREAM operator :• Creates 2 additional threads - One to feed data to executable and other to consume data• Blocks until tuple available on executables output queue or until executable terminates• If space available in input queue, places tuple from parent operator into it
    47. 47. Performance• Initial implementation of Pig, functionality and proof of concept were considered more important• As Pig was adopted within Yahoo- better performance quickly became a priority.• Pig Mix-publicly available benchmark to measure performance on a regular basis so that the effects of individual code changes on performance could be understood.
    48. 48. Benchmark ResultsPig Mix benchmark• September 11, 2008: o Initial Apache open-source release• November 11, 2008: – Enhanced type system – Rewrote execution pipeline – Combiner enhanced• January 20, 2009: – Buffering during data parsing – Fragment-replicate join algorithm• February 23, 2009: – Rework of partitioning function used in ORDER BY to ensure more balanced distribution of keys to reducers• April 20, 2009: – Branching execution plans• Vertical axis : Ratio of total running time for 12 Pig programs to corresponding Map-Reduce programs• Current performance ratio is 1:5 - Reasonable trade of point between execution time and code development/maintenance effort.
    49. 49. Pros & Cons• The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.• With the various interleaved clauses in SQL It is difficult to know what is actually happening sequentially.
    50. 50. Pros & Cons• Explicit Dataflow • Column wise Storage• Retains Properties of Map- structures are missing Reduce • Memory Management• Scalability • No facilitation for Non Java• Fault Tolerance Users• Multi Way Processing • Limited Optimization• Open Source • No GUI for Flow Graphs
    51. 51. Future Work• Query optimization – Currently rule-based optimizer for plan rearrangement and join selection – Cost-based in the future• Non-Java UDFs• SQL interface• Grouping and joining of pre-partitioned/sorted data. – Avoid data shuffling for grouping and joining – Building metadata facilities to keep track of data layout• Skew handling. – For load balancing
    52. 52. Summary• Big demand for parallel data processing – Programmers like dataflow pipes over static files• Ease of programming.• UDF -Users can create their own functions to do special- purpose processing.• Optimization opportunities :The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.• Open sourcePig Latin : Sweet spot between map-reduce and SQL
    53. 53. Related Work• Sawzall – Data processing language on top of map-reduce – Rigid structure of filtering followed by aggregation• Hive – SQL-like language on top of Map-Reduce• DryadLINQ – SQL-like language on top of Dryad
    54. 54. Thank You!