Pig Experience

Building a HighLevel Dataflow System
on top of MapReduce: The Pig
Experience

Tilani Gunawardena

Content
• Introduction
• Background
• System Overview
• The System & Type Inference
• Compilation To Map-Reduce
• Plan Execution
• Streaming
• Performance
• Adoption
• Project Experience
• Future Works

Introduction
• Internet companies swimming in data
• TBs/day for Yahoo! Or Google!
• PBs/day for FaceBook!

• Data
– unstructured elements
• web page text,images
– structured elements
• web page click records , extracted entity-relationship models

• Procesing
– Filter, join , count

• Data Warehousing??
– Scale -Often not scalable enough
– Price-Prohibitively expensive at web scale
– SQL-
• High level declarative approach
• Little control over execution method

• The Map-Reduce Appeal ??
– Scale -Scalable due to simpler design, Explicit programming model
– Price-Runs on cheap commodity hardware
– SQL

MapReduce Disadvantages
• Does not directly support complex N-step
dataflow
• Lacks explicit support for combined processing of
multiple data sets
– joins and other data matching operations
• Frequently needed data manipulation primitives
must be coded by hand
– Filtering, aggregation ,Join,Projecton,Sorting

Pig
• Pig's language Pig Latin – Chooses spot
between MapReduce framework and SQL.
• Defines a new language to allow better
control in large scale data processing
• Allow database programmers not to write
map and reduce code, which is at too low
a level

Pig Latin: Data Types
• Rich and Simple Data Model

Simple Types:
int, long, double, chararray(string), bytearray

Complex Types:
• Map: is an associative array;key:chararray;value: any type
• Tuple: Collection of fields e.g. (áppe’, ‘mango’)
• Bag: Collection of tuples
{ (‘apple’ , ‘mango’)
(ápple’, (‘red’ , ‘yellow’))
}

Pig Latin: Input/Output Data
Input:
queries = LOAD `data.txt'
USING BinStorage
AS (url, category, pagerank);
Output:
STORE result INTO `myoutput‘ ;

BinStorage: binary storage function in Pig

Pig Latin: General Syntax
• Discarding Unwanted Data: FILTER
• Comparison operators such
as ==, eq, !=, neq
• Logical connectors AND, OR, NOT

Pig Latin: Type Declaration
• Pig supports three options for declaring the data types of
field
– No data types are declared:default is to treat all fields as
bytearray.
Ex:a = LOAD `data' USING BinStorage AS (user);

– Declaring types in Pig is to provide them explicitly as part of the
AS clause during the LOAD:
Ex :a =LOAD `data' USING BinStorage AS (user:chararray);

– For the load function itself to provide the schema
information, which accommodates self-describing data formats
such as JSON

Pig Latin: Lazy Conversion of Types
• When Pig does need to cast a bytearray to another type because
the program applies a type-specic operator, it delays that cast to
the point where it is actually necessary.

• Status will need to be cast to a chararray
• EarnedPoints and possiblePoints will need to be cast to double
• These casts will not be done when the data is loaded
• They will be done as part of the comparison and division
operations
• Avoids casting values that are removed by the filter before the
result of the cast is used.

Pig Latin-Operators

• LOAD : LOAD 'data' [USING function] [AS schema];
where, „data‟ : Name of file or directory
USING, AS : Keywords
function : Load function.
schema : Loader produces data of type specified by schema. If data does not
conform to schema, error is generated.
ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);
LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp);

• STORE : Stores results to file system
– STORE alias INTO 'directory' [USING function];
where, alias : name of relation
INTO, USING : keywords
„directory‟ : storage directory‟s name. If directory already exists,
operation fails
function: Store function.
ex: STORE result INTO `myOutput';
STORE query_revenues INTO `myoutput‘ USING myStore();

FOREACH
• Generates data transformations based on columns of data.
• Eg: X = FOREACH A GENERATE a1, a2;

expanded_queries = FOREACH queries GENERATE userId,
expandQuery(queryString);
-----------------
expanded_queries = FOREACH queries GENERATE userId,
FLATTEN(expandQuery(queryString));

GROUP / COGROUP
• Groups the data in one or more relations.
• GROUP used for 1 relation
• COGROUP used for 1 to 127 relations

JOIN (inner)
• Performs inner join of 2 or more relations based on common field values.

Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)}
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,6)
(4,2,1,4,9)

ORDER BY
• Sorts relation based on 1 or more fields

Eg: X = ORDER A BY a3 DESC;
(1,2,3)
(4,2,1)

• A step-by-step dataflow language where
computation steps are chained together through
the use of variables,
• The use of high-level transformations, e.g.,
GROUP, FILTER
• The ability to specify schemas as part of issuing a
program
• The use of userdened functions (e.g., top10)

Pig allows three modes of user interaction:
• Interactive mode:the user is presented with an
interactive shell (called Grunt), which accepts Pig
commands.
• Batch mode:A user submits a prewritten script
containing a series of Pig commands
• Embedded mode:Pig is also provided as a Java
library allowing Pig Latin commands to be
submitted via method invocations from a Java
program

Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager

 Parser
• Verifies program is syntactically correct and that all referenced variables are defined.
• Type checking
• Schema inference
• Verify ability to instantiate classes corresponding to UDF
• Confirm existence of streaming executables

– Output of parser :Logical plan
• One-to-one correspondence between Pig Latin statements & logical operators.
• Arranged in directed acyclic graph (DAG)

 Logical Optimizer
• Logical optimizations
– Projection pushdown are carried out

Map-Reduce Compiler:Logical to Physical compilation(1)

 Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN

Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs

(CO)GROUP operator becomes series of
3 physical operators :-
 Local and global rearrange operators –
Group tuples on same machine and
adjacent in data stream;
Rearrange – hashing or sorting by key
• Package operator -places adjacent same key
tuples into a single-tuple package

JOIN operator handled in 2 ways :-
 rewritten into COGROUP followed by FOREACH
operator to perform “flattening” to get
parallel hash-join or sort-merge join;
 Fragment-replicate join
– which executes entirely in the map stage or entirely
In the reduce stage

Example for (CO)GROUP Conversion:
• (1,R),(2,G) in stream A
• (1,B), (2,Y) in stream B

• Local Rearrange Operator :
– Eg: Converts tuple (1,R) to {1,(1,R)}

• Global Rearrange operator: Sort
– Eg: Reducer 1 : {1,{(1,R),(1,B)}}
Reducer 2: {2,{(2,G),(2,Y)}}

• Package Operator:
– Places same-key tuples into single-tuple
package
– Eg: Reducer 1: {1,{(1,R)},{(1,B)}}
Reducer 2: {2,{(2,G)},{(2,Y)}}

3 types of Join operators
 Fragment-replicate join
• Joins huge table & very small table
Huge table fragmented and
distributed to mappers (or reducers)
• Small table replicates to each machine
• Either in map or reduce stage
 Parallel-hash join
• Map stage - Hashes tables by join key
• Reduce stage - Joins fragments of tables
– Data with same hash values assigned to 1 reducer
 Sort-merge join
• Both inputs sorted on join key
• Each node gets a fragment of the sorted table, same keys got to the same table
• Each node performs join; Only map step is sufficient

Map-Reduce Compiler:Physical to Map-Reduce Compilation(1)

 Physical to MapReduce Compilation:

• Physical operators assigned to Hadoop
stages to minimize no of reduce stages

• Local rearrange operator –
simply annotates tuples with keys and stream identiers ,
and lets Hadoop local sort stage to do work

• Global rearrange operators removed .
Implemented by Hadoop shuffle and
merge stages

• Load and store operators removed.
Hadoop framework reads and writes data

Map-Reduce Compiler:Branching Plans(1)
 Branching Plans
• More than 1 STORE command – For each branch of split
• Data read once; Processed in multiple ways;
• Risk of data spilling to disk
• SPLIT operator :- Feeds copy of input to each nested sub-plan
Example 1: Logical Split command – Splits Table
• Only Map-Plan
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
SPLIT clicks INTO
pages IF pageid IS NOT NULL, // Corresponds to „FILTER‟ of 1st Sub-Plan
links IF linkid IS NOT NULL; // Corresponds to „FILTER‟ of 2nd Sub-Plan
// 1st Sub-Plan:
cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat;
STORE cpages INTO `pages';
// 2nd Sub-Plan:
clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat;
STORE clinks INTO `links';

Map-Reduce Compiler:Branching Plans(2)
Example2:
• Split propagates across map/reduce boundary
• No logical SPLIT operator
• Compiler inserts physical SPLIT operator
• MULTIPLEX operator : Routes tuples to correct sub-plan;
In Reduce stage only.

goodclicks = FILTER clicks BY viewedat IS NOT NULL;
// 1st Sub-Plan: Grouped by „pageid‟
bypage = GROUP goodclicks BY pageid;
cntbypage = FOREACH bypage GENERATE
group,COUNT(goodclicks);
STORE cntbypage INTO `bypage';
//2nd Sub-Plan: Grouped by „linkid‟
bylink = GROUP goodclicks BY linkid;
cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks);
STORE cntbylink INTO `bylink';

Map-Reduce Optimizer

Performs early partial aggregation in distributive or algebraic aggregation functions

eg: for function AVERAGE, the steps are:-
a) Initial
 e.g. generate (sum, count) pairs
 Assigned to map stage.
b) intermediate
 e.g. combine n (sum,count) pairs into a single pair
 Assigned to Combine stage.
c) final
 e.g. combine n (sum,count) pairs and take the quotient
 Assigned to Reduce step

Hadoop Job Manager

• Map-Reduce jobs sorted and submitted to Hadoop for
execution
• Java jar file generated for Map and Reduce
implementation classes and UDF
• Map and Reduce classes contain general-purpose
dataflow execution engines
• Monitor and generates periodic reports
• Warnings or errors logged and reported

 Plan Execution
• Flow Control
– Nested Programs
• Memory Management
 Streaming
• Flow Control

PLAN EXECUTION - FLOW CONTROL
• Execution of Map or Reduce stage in Physical Plan by Pig
• Assume that data flows downward in an execution plan

• To control movement of tuples through execution pipeline, 2 models available
– Push & Pull(Iterator) Model
1) Push Model:
Eg: Operator A pushes data to B that operates on it, and pushes the result to C.
(A,B and C are physical operators)

Difficult to implement for:
• UDF with multiple inputs
• Binary operators like fragment-replicate join

2) Pull Model :

Eg: Operator C asks B for its next data item.
If B has nothing pending to return, it asks A.
When A returns a data item, B operates on it, and returns the result to C

Advantages:
 Single-threaded implementation : Avoids context-switching overhead
 Simple APIs for UDF
Drawback:
 Operations over bag nested inside tuple may lead to memory overflow
 If data flow graph has multiple sinks-operators at branch points may be required to buffer an
unbounded number of tuples

PLAN EXECUTION - FLOW CONTROL (2)

Solution :
Response of operator, when asked to produce tuple
a) Return tuple;
b) Declare itself finished ; Or
c) Return pause signal to indicate not finished; not able to produce output tuple;

NESTED PROGRAMS:
• Pig Operators invoked over bags nested within tuples
• For example: (To compute number of distinct pages and links visited by user)
(Alice,Page1,Linnk1,Site1)
(John,Page1,Link2,Site2)
(John,Page2, Link2,Site3)
byuser = GROUP clicks BY userid;
(Alice, {(Alice, Page1,Linnk1,Site1)})
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)})
result = FOREACH byuser
{
uniqPages = DISTINCT clicks.pageid;
uniqLinks = DISTINCT clicks.linkid;
GENERATE group, COUNT(uniqPages),COUNT(uniqLinks);
};
(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )


• Outer operator graph contains FOREACH operator
• Contains nested operator graph of 2 pipelines
• Each pipeline contains DISTINCT and COUNT operators
• FOREACH requests tuple T from PACKAGE operator
• Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator
• Requests tuple from the bottom of pipeline (COUNT operator)
• Process repeated for second pipeline
• FOREACH operator constructs and returns output tuple

PLAN EXECUTION - FLOW CONTROL
• When nested plan is single branching pipeline:
(Alice,Page1,Linnk1,Site1)
(John,Page1,Link2,Site2)
(John,Page2, Link2,NULL)
byuser = GROUP clicks BY userid;
(Alice, {(Alice, Page1,Linnk1,Site1)})
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)})
result = FOREACH byuser
{
fltrd = FILTER clicks BY viewedat IS NOT NULL;
uniqPages = DISTINCT fltrd.pageid;
uniqLinks = DISTINCT fltrd.linkid;
GENERATE group, COUNT(uniqPages), COUNT(uniqLinks);
};
(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
(John, {(John,Page1,Link2,Site2)} , 1 , 1 )

A more complex situation arises when the nested plan is not two independent pipelines but rather a
single branching pipeline
Solution:
• Pig currently handles this case by duplicating the FILTER operator and producing two independent
pipelines, to be executed as explained above.

PLAN EXECUTION - Memory Management

• Hadoop, Pig is implemented in Java.
• Java memory management problems during query processing
– Java does not allow the developer to control memory
allocation and deallocation directly,
• naive option :is to increase the JVM memory size limit
beyond the physical memory size, and let the virtual
memory manager take care of staging data between
memory and disk.
– Problem: performance degradation.

• Better to return an “out-of-memory" error
– administrator can adjust the memory management
parameters and re-submit the program

PLAN EXECUTION - Memory Management

• Memory overflow mostly due to large bags of tuples

• Java's MemoryPoolMXBean class notifies low memory situation.
If notified, PIG spills excess bags to disk.

• Pig estimates bag sizes by sampling few tuples

• Memory manager maintains list of Pig bags created in same JVM
using linked list of Java WeakReferences

• WeakReference ensures garbage collection of bags no longer in use

STREAMING – FLOW CONTROL
• Pig allows User-dened functions (UDFs)
– UDFs must be written in Java and must conform to Pig's UDF interface
– Has synchronous behavior

Streaming :
• Allows data to be pushed through external executables
– users are able to intermix relational operations like grouping and filtering with custom or
legacy executables.
• Streaming executable behaves asynchronously.

challenges in implementing streaming in Pig :
 fitting it into the iterator model of Pig's execution pipeline
• Because of asynchronous behavior of the user's executable
• STREAM operator that wraps the executable cannot simply pull tuples synchronously
as it does with other operators because it does not know what state executable is in.
• There may be no output :
– executable is waiting to receive more input: the stream operator needs to push
new data
– executable is still busy processing prior inputs. :the stream operator should wait.

• Single-threaded operator execution model, a deadlock can
occur
– Pig operator is waiting for the external executable to
consume a new input tuple, while at the same time the
executable is waiting for its output to be consumed

Solution :
STREAM operator :
• Creates 2 additional threads - One to feed data to executable and other to
consume data
• Blocks until tuple available on executable's output queue or until executable
terminates
• If space available in input queue, places tuple from parent operator into it

Performance
• Initial implementation of Pig, functionality and
proof of concept were considered more
important
• As Pig was adopted within Yahoo- better
performance quickly became a priority.

• Pig Mix-publicly available benchmark to
measure performance on a regular basis so that
the effects of individual code changes on
performance could be understood.

Benchmark Results
Pig Mix benchmark
• September 11, 2008:
o Initial Apache open-source release
• November 11, 2008:
– Enhanced type system
– Rewrote execution pipeline
– Combiner enhanced
• January 20, 2009:
– Buffering during data parsing
– Fragment-replicate join algorithm
• February 23, 2009:
– Rework of partitioning function used in ORDER BY to ensure more balanced
distribution of keys to reducers
• April 20, 2009:
– Branching execution plans
• Vertical axis : Ratio of total running time for 12 Pig programs
to corresponding Map-Reduce programs
• Current performance ratio is 1:5 - Reasonable trade of point between execution time and
code development/maintenance effort.

Pros & Cons
• The step-by-step method of creating a
program in Pig is much cleaner and simpler to
use than the single block method of SQL. It is
easier to keep track of what your variables
are, and where you are in the process of
analyzing your data.
• With the various interleaved clauses in SQL It
is difficult to know what is actually happening
sequentially.

Pros & Cons

• Explicit Dataflow • Column wise Storage
• Retains Properties of Map- structures are missing
Reduce • Memory Management
• Scalability • No facilitation for Non Java
• Fault Tolerance Users
• Multi Way Processing • Limited Optimization
• Open Source
• No GUI for Flow Graphs

Future Work
• Query optimization
– Currently rule-based optimizer for plan rearrangement and join selection
– Cost-based in the future
• Non-Java UDFs
• SQL interface
• Grouping and joining of pre-partitioned/sorted data.
– Avoid data shuffling for grouping and joining
– Building metadata facilities to keep track of data layout
• Skew handling.
– For load balancing

Summary
• Big demand for parallel data processing
– Programmers like dataflow pipes over static files
• Ease of programming.
• UDF -Users can create their own functions to do special-
purpose processing.
• Optimization opportunities :The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather
than efficiency.
• Open source

Pig Latin : Sweet spot between map-reduce and SQL

Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad

Pig Experience

More Related Content

What's hot

Similar to Pig Experience

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Pig Experience

Editor's Notes