This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
Efficient Parallel Set-Similarity Joins Using MapReduce
Pig Experience
1. Building a HighLevel Dataflow System
on top of MapReduce: The Pig
Experience
Tilani Gunawardena
2. Content
• Introduction
• Background
• System Overview
• The System & Type Inference
• Compilation To Map-Reduce
• Plan Execution
• Streaming
• Performance
• Adoption
• Project Experience
• Future Works
3. Introduction
• Internet companies swimming in data
• TBs/day for Yahoo! Or Google!
• PBs/day for FaceBook!
• Data
– unstructured elements
• web page text,images
– structured elements
• web page click records , extracted entity-relationship models
• Procesing
– Filter, join , count
• Data Warehousing??
– Scale -Often not scalable enough
– Price-Prohibitively expensive at web scale
– SQL-
• High level declarative approach
• Little control over execution method
• The Map-Reduce Appeal ??
– Scale -Scalable due to simpler design, Explicit programming model
– Price-Runs on cheap commodity hardware
– SQL
4. MapReduce Disadvantages
• Does not directly support complex N-step
dataflow
• Lacks explicit support for combined processing of
multiple data sets
– joins and other data matching operations
• Frequently needed data manipulation primitives
must be coded by hand
– Filtering, aggregation ,Join,Projecton,Sorting
5. Pig
• Pig's language Pig Latin – Chooses spot
between MapReduce framework and SQL.
• Defines a new language to allow better
control in large scale data processing
• Allow database programmers not to write
map and reduce code, which is at too low
a level
6. Pig Latin: Data Types
• Rich and Simple Data Model
Simple Types:
int, long, double, chararray(string), bytearray
Complex Types:
• Map: is an associative array;key:chararray;value: any type
• Tuple: Collection of fields e.g. (áppe’, ‘mango’)
• Bag: Collection of tuples
{ (‘apple’ , ‘mango’)
(ápple’, (‘red’ , ‘yellow’))
}
7. Pig Latin: Input/Output Data
Input:
queries = LOAD `data.txt'
USING BinStorage
AS (url, category, pagerank);
Output:
STORE result INTO `myoutput‘ ;
BinStorage: binary storage function in Pig
9. Pig Latin: General Syntax
• Discarding Unwanted Data: FILTER
• Comparison operators such
as ==, eq, !=, neq
• Logical connectors AND, OR, NOT
10. Pig Latin: Type Declaration
• Pig supports three options for declaring the data types of
field
– No data types are declared:default is to treat all fields as
bytearray.
Ex:a = LOAD `data' USING BinStorage AS (user);
– Declaring types in Pig is to provide them explicitly as part of the
AS clause during the LOAD:
Ex :a =LOAD `data' USING BinStorage AS (user:chararray);
– For the load function itself to provide the schema
information, which accommodates self-describing data formats
such as JSON
11. Pig Latin: Lazy Conversion of Types
• When Pig does need to cast a bytearray to another type because
the program applies a type-specic operator, it delays that cast to
the point where it is actually necessary.
• Status will need to be cast to a chararray
• EarnedPoints and possiblePoints will need to be cast to double
• These casts will not be done when the data is loaded
• They will be done as part of the comparison and division
operations
• Avoids casting values that are removed by the filter before the
result of the cast is used.
12. Pig Latin-Operators
• LOAD : LOAD 'data' [USING function] [AS schema];
where, „data‟ : Name of file or directory
USING, AS : Keywords
function : Load function.
schema : Loader produces data of type specified by schema. If data does not
conform to schema, error is generated.
ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);
LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp);
• STORE : Stores results to file system
– STORE alias INTO 'directory' [USING function];
where, alias : name of relation
INTO, USING : keywords
„directory‟ : storage directory‟s name. If directory already exists,
operation fails
function: Store function.
ex: STORE result INTO `myOutput';
STORE query_revenues INTO `myoutput‘ USING myStore();
13. FOREACH
• Generates data transformations based on columns of data.
• Eg: X = FOREACH A GENERATE a1, a2;
expanded_queries = FOREACH queries GENERATE userId,
expandQuery(queryString);
-----------------
expanded_queries = FOREACH queries GENERATE userId,
FLATTEN(expandQuery(queryString));
14. GROUP / COGROUP
• Groups the data in one or more relations.
• GROUP used for 1 relation
• COGROUP used for 1 to 127 relations
15. JOIN (inner)
• Performs inner join of 2 or more relations based on common field values.
Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)}
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,6)
(4,2,1,4,9)
ORDER BY
• Sorts relation based on 1 or more fields
Eg: X = ORDER A BY a3 DESC;
(1,2,3)
(4,2,1)
17. • A step-by-step dataflow language where
computation steps are chained together through
the use of variables,
• The use of high-level transformations, e.g.,
GROUP, FILTER
• The ability to specify schemas as part of issuing a
program
• The use of userdened functions (e.g., top10)
18. Pig allows three modes of user interaction:
• Interactive mode:the user is presented with an
interactive shell (called Grunt), which accepts Pig
commands.
• Batch mode:A user submits a prewritten script
containing a series of Pig commands
• Embedded mode:Pig is also provided as a Java
library allowing Pig Latin commands to be
submitted via method invocations from a Java
program
19. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
20. Parser
• Verifies program is syntactically correct and that all referenced variables are defined.
• Type checking
• Schema inference
• Verify ability to instantiate classes corresponding to UDF
• Confirm existence of streaming executables
– Output of parser :Logical plan
• One-to-one correspondence between Pig Latin statements & logical operators.
• Arranged in directed acyclic graph (DAG)
Logical Optimizer
• Logical optimizations
– Projection pushdown are carried out
21. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
22. Map-Reduce Compiler:Logical to Physical compilation(1)
Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN
23. Map-Reduce Compiler:Logical to Physical compilation(2)
Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs
(CO)GROUP operator becomes series of
3 physical operators :-
Local and global rearrange operators –
Group tuples on same machine and
adjacent in data stream;
Rearrange – hashing or sorting by key
• Package operator -places adjacent same key
tuples into a single-tuple package
JOIN operator handled in 2 ways :-
rewritten into COGROUP followed by FOREACH
operator to perform “flattening” to get
parallel hash-join or sort-merge join;
Fragment-replicate join
– which executes entirely in the map stage or entirely
In the reduce stage
24. Map-Reduce Compiler:Logical to Physical compilation(3)
Example for (CO)GROUP Conversion:
• (1,R),(2,G) in stream A
• (1,B), (2,Y) in stream B
• Local Rearrange Operator :
– Eg: Converts tuple (1,R) to {1,(1,R)}
• Global Rearrange operator: Sort
– Eg: Reducer 1 : {1,{(1,R),(1,B)}}
Reducer 2: {2,{(2,G),(2,Y)}}
• Package Operator:
– Places same-key tuples into single-tuple
package
– Eg: Reducer 1: {1,{(1,R)},{(1,B)}}
Reducer 2: {2,{(2,G)},{(2,Y)}}
25. Map-Reduce Compiler:Logical to Physical compilation(4)
3 types of Join operators
Fragment-replicate join
• Joins huge table & very small table
Huge table fragmented and
distributed to mappers (or reducers)
• Small table replicates to each machine
• Either in map or reduce stage
Parallel-hash join
• Map stage - Hashes tables by join key
• Reduce stage - Joins fragments of tables
– Data with same hash values assigned to 1 reducer
Sort-merge join
• Both inputs sorted on join key
• Each node gets a fragment of the sorted table, same keys got to the same table
• Each node performs join; Only map step is sufficient
26. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
27. Map-Reduce Compiler:Physical to Map-Reduce Compilation(1)
Physical to MapReduce Compilation:
• Physical operators assigned to Hadoop
stages to minimize no of reduce stages
• Local rearrange operator –
simply annotates tuples with keys and stream identiers ,
and lets Hadoop local sort stage to do work
• Global rearrange operators removed .
Implemented by Hadoop shuffle and
merge stages
• Load and store operators removed.
Hadoop framework reads and writes data
28. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
29. Map-Reduce Compiler:Branching Plans(1)
Branching Plans
• More than 1 STORE command – For each branch of split
• Data read once; Processed in multiple ways;
• Risk of data spilling to disk
• SPLIT operator :- Feeds copy of input to each nested sub-plan
Example 1: Logical Split command – Splits Table
• Only Map-Plan
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
SPLIT clicks INTO
pages IF pageid IS NOT NULL, // Corresponds to „FILTER‟ of 1st Sub-Plan
links IF linkid IS NOT NULL; // Corresponds to „FILTER‟ of 2nd Sub-Plan
// 1st Sub-Plan:
cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat;
STORE cpages INTO `pages';
// 2nd Sub-Plan:
clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat;
STORE clinks INTO `links';
30. Map-Reduce Compiler:Branching Plans(2)
Example2:
• Split propagates across map/reduce boundary
• No logical SPLIT operator
• Compiler inserts physical SPLIT operator
• MULTIPLEX operator : Routes tuples to correct sub-plan;
In Reduce stage only.
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
goodclicks = FILTER clicks BY viewedat IS NOT NULL;
// 1st Sub-Plan: Grouped by „pageid‟
bypage = GROUP goodclicks BY pageid;
cntbypage = FOREACH bypage GENERATE
group,COUNT(goodclicks);
STORE cntbypage INTO `bypage';
//2nd Sub-Plan: Grouped by „linkid‟
bylink = GROUP goodclicks BY linkid;
cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks);
STORE cntbylink INTO `bylink';
31. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
32. Map-Reduce Optimizer
Performs early partial aggregation in distributive or algebraic aggregation functions
eg: for function AVERAGE, the steps are:-
a) Initial
e.g. generate (sum, count) pairs
Assigned to map stage.
b) intermediate
e.g. combine n (sum,count) pairs into a single pair
Assigned to Combine stage.
c) final
e.g. combine n (sum,count) pairs and take the quotient
Assigned to Reduce step
33. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
34. Hadoop Job Manager
• Map-Reduce jobs sorted and submitted to Hadoop for
execution
• Java jar file generated for Map and Reduce
implementation classes and UDF
• Map and Reduce classes contain general-purpose
dataflow execution engines
• Monitor and generates periodic reports
• Warnings or errors logged and reported
35. Plan Execution
• Flow Control
– Nested Programs
• Memory Management
Streaming
• Flow Control
36. PLAN EXECUTION - FLOW CONTROL
• Execution of Map or Reduce stage in Physical Plan by Pig
• Assume that data flows downward in an execution plan
• To control movement of tuples through execution pipeline, 2 models available
– Push & Pull(Iterator) Model
1) Push Model:
Eg: Operator A pushes data to B that operates on it, and pushes the result to C.
(A,B and C are physical operators)
Difficult to implement for:
• UDF with multiple inputs
• Binary operators like fragment-replicate join
2) Pull Model :
Eg: Operator C asks B for its next data item.
If B has nothing pending to return, it asks A.
When A returns a data item, B operates on it, and returns the result to C
Advantages:
Single-threaded implementation : Avoids context-switching overhead
Simple APIs for UDF
Drawback:
Operations over bag nested inside tuple may lead to memory overflow
If data flow graph has multiple sinks-operators at branch points may be required to buffer an
unbounded number of tuples
37. PLAN EXECUTION - FLOW CONTROL (2)
Solution :
Response of operator, when asked to produce tuple
a) Return tuple;
b) Declare itself finished ; Or
c) Return pause signal to indicate not finished; not able to produce output tuple;
38. PLAN EXECUTION - FLOW CONTROL (3)
NESTED PROGRAMS:
• Pig Operators invoked over bags nested within tuples
• For example: (To compute number of distinct pages and links visited by user)
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
(Alice,Page1,Linnk1,Site1)
(John,Page1,Link2,Site2)
(John,Page2, Link2,Site3)
byuser = GROUP clicks BY userid;
(Alice, {(Alice, Page1,Linnk1,Site1)})
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)})
result = FOREACH byuser
{
uniqPages = DISTINCT clicks.pageid;
uniqLinks = DISTINCT clicks.linkid;
GENERATE group, COUNT(uniqPages),COUNT(uniqLinks);
};
(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )
39. PLAN EXECUTION - FLOW CONTROL (4)
• Outer operator graph contains FOREACH operator
• Contains nested operator graph of 2 pipelines
• Each pipeline contains DISTINCT and COUNT operators
• FOREACH requests tuple T from PACKAGE operator
• Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator
• Requests tuple from the bottom of pipeline (COUNT operator)
• Process repeated for second pipeline
• FOREACH operator constructs and returns output tuple
40. PLAN EXECUTION - FLOW CONTROL
• When nested plan is single branching pipeline:
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
(Alice,Page1,Linnk1,Site1)
(John,Page1,Link2,Site2)
(John,Page2, Link2,NULL)
byuser = GROUP clicks BY userid;
(Alice, {(Alice, Page1,Linnk1,Site1)})
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)})
result = FOREACH byuser
{
fltrd = FILTER clicks BY viewedat IS NOT NULL;
uniqPages = DISTINCT fltrd.pageid;
uniqLinks = DISTINCT fltrd.linkid;
GENERATE group, COUNT(uniqPages), COUNT(uniqLinks);
};
(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
(John, {(John,Page1,Link2,Site2)} , 1 , 1 )
A more complex situation arises when the nested plan is not two independent pipelines but rather a
single branching pipeline
Solution:
• Pig currently handles this case by duplicating the FILTER operator and producing two independent
pipelines, to be executed as explained above.
41. Plan Execution
• Flow Control
– Nested Programs
• Memory Management
Streaming
• Flow Control
42. PLAN EXECUTION - Memory Management
• Hadoop, Pig is implemented in Java.
• Java memory management problems during query processing
– Java does not allow the developer to control memory
allocation and deallocation directly,
• naive option :is to increase the JVM memory size limit
beyond the physical memory size, and let the virtual
memory manager take care of staging data between
memory and disk.
– Problem: performance degradation.
• Better to return an “out-of-memory" error
– administrator can adjust the memory management
parameters and re-submit the program
43. PLAN EXECUTION - Memory Management
• Memory overflow mostly due to large bags of tuples
• Java's MemoryPoolMXBean class notifies low memory situation.
If notified, PIG spills excess bags to disk.
• Pig estimates bag sizes by sampling few tuples
• Memory manager maintains list of Pig bags created in same JVM
using linked list of Java WeakReferences
• WeakReference ensures garbage collection of bags no longer in use
44. Plan Execution
• Flow Control
– Nested Programs
• Memory Management
Streaming
• Flow Control
45. STREAMING – FLOW CONTROL
• Pig allows User-dened functions (UDFs)
– UDFs must be written in Java and must conform to Pig's UDF interface
– Has synchronous behavior
Streaming :
• Allows data to be pushed through external executables
– users are able to intermix relational operations like grouping and filtering with custom or
legacy executables.
• Streaming executable behaves asynchronously.
challenges in implementing streaming in Pig :
fitting it into the iterator model of Pig's execution pipeline
• Because of asynchronous behavior of the user's executable
• STREAM operator that wraps the executable cannot simply pull tuples synchronously
as it does with other operators because it does not know what state executable is in.
• There may be no output :
– executable is waiting to receive more input: the stream operator needs to push
new data
– executable is still busy processing prior inputs. :the stream operator should wait.
46. • Single-threaded operator execution model, a deadlock can
occur
– Pig operator is waiting for the external executable to
consume a new input tuple, while at the same time the
executable is waiting for its output to be consumed
Solution :
STREAM operator :
• Creates 2 additional threads - One to feed data to executable and other to
consume data
• Blocks until tuple available on executable's output queue or until executable
terminates
• If space available in input queue, places tuple from parent operator into it
47. Performance
• Initial implementation of Pig, functionality and
proof of concept were considered more
important
• As Pig was adopted within Yahoo- better
performance quickly became a priority.
• Pig Mix-publicly available benchmark to
measure performance on a regular basis so that
the effects of individual code changes on
performance could be understood.
48. Benchmark Results
Pig Mix benchmark
• September 11, 2008:
o Initial Apache open-source release
• November 11, 2008:
– Enhanced type system
– Rewrote execution pipeline
– Combiner enhanced
• January 20, 2009:
– Buffering during data parsing
– Fragment-replicate join algorithm
• February 23, 2009:
– Rework of partitioning function used in ORDER BY to ensure more balanced
distribution of keys to reducers
• April 20, 2009:
– Branching execution plans
• Vertical axis : Ratio of total running time for 12 Pig programs
to corresponding Map-Reduce programs
• Current performance ratio is 1:5 - Reasonable trade of point between execution time and
code development/maintenance effort.
49. Pros & Cons
• The step-by-step method of creating a
program in Pig is much cleaner and simpler to
use than the single block method of SQL. It is
easier to keep track of what your variables
are, and where you are in the process of
analyzing your data.
• With the various interleaved clauses in SQL It
is difficult to know what is actually happening
sequentially.
50. Pros & Cons
• Explicit Dataflow • Column wise Storage
• Retains Properties of Map- structures are missing
Reduce • Memory Management
• Scalability • No facilitation for Non Java
• Fault Tolerance Users
• Multi Way Processing • Limited Optimization
• Open Source
• No GUI for Flow Graphs
51. Future Work
• Query optimization
– Currently rule-based optimizer for plan rearrangement and join selection
– Cost-based in the future
• Non-Java UDFs
• SQL interface
• Grouping and joining of pre-partitioned/sorted data.
– Avoid data shuffling for grouping and joining
– Building metadata facilities to keep track of data layout
• Skew handling.
– For load balancing
52. Summary
• Big demand for parallel data processing
– Programmers like dataflow pipes over static files
• Ease of programming.
• UDF -Users can create their own functions to do special-
purpose processing.
• Optimization opportunities :The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather
than efficiency.
• Open source
Pig Latin : Sweet spot between map-reduce and SQL
53. Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
More natural to programmers than flat tuples ,Avoids expensive joins
September 11, 2008: Initial Apache open-source releaseNovember 11, 2008:Enhanced type system, rewrote execution pipeline, enhanced use of combinerJanuary 20, 2009: Rework of buffering during data parsing, fragment-replicate join algorithmFebruary 23, 2009: Rework of partitioning function used in ORDER BY to ensure more balanced distribution of keys to reducersApril 20, 2009: Branching execution plans