SlideShare a Scribd company logo
Pig Latin: A Not-So-Foreign
Language for Data Processing
Motivation
 You‟re a procedural programmer
 You have huge data
 You want to analyze it
2
Motivation
 As a procedural programmer…
 May find writing queries in SQL unnatural and too restrictive
 More comfortable with writing code; a series of statements as
opposed to a long query. (Ex: MapReduce is so successful).
3
Motivation
 Data analysis goals
 Quick
 Exploit parallel processing power of a distributed system
 Easy
 Be able to write a program or query without a huge learning curve
 Have some common analysis tasks predefined
 Flexible
 Transform a data set(s) into a workable structure without much
overhead
 Perform customized processing
 Transparent
 Have a say in how the data processing is executed on the system
5
Motivation
 Relational Distributed Databases
 Parallel database products expensive
 Rigid schemas
 Processing requires declarative SQL query construction
 Map-Reduce
 Relies on custom code for even common operations
 Need to do workarounds for tasks that have different data
flows other than the expected MapCombineReduce
6
Motivation
 Relational Distributed Databases
 Sweet Spot: Take the best of both SQL and Map-Reduce;
combine high-level declarative querying with low-level
procedural programming…Pig Latin!
 Map-Reduce
7
Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-
pagerank urls in that category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
Outline
 System Overview
 Pig Latin (The Language)
 Data Structures
 Commands
 Pig (The Compiler)
 Logical & Physical Plans
 Optimization
 Efficiency
 Pig Pen (The Debugger)
 Conclusion
8
Big Picture
Pig Latin
Script
User-
Defined
Functions
Pig
Map-Reduce
Statements
Compile
Optimize
Write Results Read Data
10
Data Model
 Atom - simple atomic value (ie: number or string)
 Tuple
 Bag
 Map
11
Data Model
 Atom
 Tuple - sequence of fields; each field any type
 Bag
 Map
12
Data Model
 Atom
 Tuple
 Bag - collection of tuples
 Duplicates possible
 Tuples in a bag can have different field lengths and field types
 Map
13
Data Model
 Atom
 Tuple
 Bag
 Map - collection of key-value pairs
 Key is an atom; value can be any type
14
Data Model
 Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested
 More natural for procedural programmers (target user) than
normalization
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions
 No schema required
15
Data Model
 User-Defined Functions (UDFs)
 Can be used in many Pig Latin statements
 Useful for custom processing tasks
 Can use non-atomic values for input and output
 Currently must be written in Java
16
 Ex: spam_urls = FILTER urls BY isSpam(url);
Speaking Pig Latin
 LOAD
 Input is assumed to be a bag (sequence of tuples)
 Can specify a deserializer with “USING‟
 Can provide a schema with “AS‟
newBag = LOAD ‘filename’
<USING functionName() >
<AS (fieldName1, fieldName2,…)>;
17
Queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userID,queryString, timeStamp)
Speaking Pig Latin
 FOREACH
 Apply some processing to each tuple in a bag
 Each field can be:
 A fieldname of the bag
 A constant
 A simple expression (ie: f1+f2)
 A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
 A UDF (ie: sumTaxes(gst, pst) )
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
Speaking Pig Latin
 FILTER
 Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
 Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
 Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
Speaking Pig Latin
 COGROUP
 Group two datasets together by a common attribute
 Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
Speaking Pig Latin
 Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
Speaking Pig Latin
 Why COGROUP and not JOIN?
 May want to process nested bags of tuples before taking the
cross product.
 Keeps to the goal of a single high-level data transformation per
pig-latin statement.
 However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
Speaking Pig Latin
 STORE (& DUMP)
 Output data to a file (or screen)
STORE bagName INTO ‘filename’
<USING deserializer ()>;
 Other Commands (incomplete)
 UNION - return the union of two or more bags
 CROSS - take the cross product of two or more bags
 ORDER - order tuples by a specified field(s)
 DISTINCT - eliminate duplicate tuples in a bag
 LIMIT - Limit results to a subset
23
Compilation
 Pig system does two tasks:
 Builds a Logical Plan from a Pig Latin script
 Supports execution platform independence
 No processing of data performed at this stage
 Compiles the Logical Plan to a Physical Plan and Executes
 Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
24
Compilation
 Building a Logical Plan
 Verify input files and bags referred to are valid
 Create a logical plan for each bag(variable) defined
25
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
27
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
28
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
Filter
29
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Foreach
30
Compilation
 Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
33
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
COGROUP
 Step 2: Push other commands into the
map and reduce functions where Map
possible
 May be the case certain commands
require their own map-reduce
Reduce
Load(user.dat)
Filter
Group
job (ie: ORDER needs separate map-
reduce jobs)
Foreach
34
Compilation
 Efficiency in Execution
 Parallelism
 Loading data - Files are loaded from HDFS
 Statements are compiled into map-reduce jobs
35
Compilation
 Efficiency with Nested Bags
 In many cases, the nested bags created in each tuple of a COGROUP
statement never need to physically materialize
 Generally perform aggregation after a COGROUP and the
statements for said aggregation are pushed into the reduce function
 Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)
36
Compilation
 Efficiency with Nested Bags
Map
Load(user.dat)
Filter
Group
Foreach
37
Compilation
 Efficiency with Nested Bags Load(user.dat)
Filter
Group
Combine
Foreach
38
Compilation
 Efficiency with Nested Bags
Reduce
Load(user.dat)
Filter
Group
Foreach
39
Compilation
 Efficiency with Nested Bags
 Why this works:
 COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data
Reduce SUM
Combine COUNT COUNT
40
Compilation
 Efficiency with Nested Bags
 Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
 Inefficiencies
 Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to
materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory
 Every map-reduce job requires data be written and replicated to the
HDFS (although this is offset by parallelism achieved)
41
Debugging
 How to verify the semantics of an analysis program
 Run the program against whole data set. Might take hours!
 Generate sample dataset
 Empty result set may occur on few operations like join, filter
 Generally, testing with sample dataset is difficult
 Pig-Pen
 Samples data from large dataset for Pig statements
 Apply individual Pig-Latin commands against the dataset
 In case of empty result, pig system resamples
 Remove redundant samples
42
Debugging
 Pig-Pen
42
Debugging
 Pig-Latin command window and command generator
43
Debugging
 Sand Box Dataset (generated automatically!)
44
Debugging
 Pig-Pen
 Provides sample data that is:
 Real - taken from actual data
 Concise - as small as possible
 Complete - collectively illustrate the key semantics of each command
 Helps with schema definition
 Facilitates incremental program writing
45
Conclusion
 Pig is a data processing environment in Hadoop that is
specifically targeted towards procedural programmers
who perform large-scale data analysis.
 Pig-Latin offers high-level data manipulation in a
procedural style.
 Pig-Pen is a debugging environment for Pig-Latin
commands that generates samples from real data.
47
More Info
 Pig, http://hadoop.apache.org/pig/
 Hadoop, http://hadoop.apache.org
Anks-
Thay!
48

More Related Content

What's hot

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
Pietro Michiardi
 
lec1_ref.pdf
lec1_ref.pdflec1_ref.pdf
lec1_ref.pdf
vishal choudhary
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
schapht
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
Andrea Iacono
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
MapReduce
MapReduceMapReduce
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 

What's hot (20)

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
lec1_ref.pdf
lec1_ref.pdflec1_ref.pdf
lec1_ref.pdf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 

Similar to Lec_4_1_IntrotoPIG.pptx

4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
akhileshyadav718837
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
DrPDShebaKeziaMalarc
 
pig.ppt
pig.pptpig.ppt
pig.ppt
Sheba41
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Pig
PigPig
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
Jim Mlodgenski
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigQiu Xiafei
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
ManjuKumara GH
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
Ross McDonald
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 

Similar to Lec_4_1_IntrotoPIG.pptx (20)

4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Pig
PigPig
Pig
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 

More from vishal choudhary

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
vishal choudhary
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
vishal choudhary
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
vishal choudhary
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
vishal choudhary
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
vishal choudhary
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
vishal choudhary
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
vishal choudhary
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
vishal choudhary
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
vishal choudhary
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
vishal choudhary
 
SE1.ppt
SE1.pptSE1.ppt
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
vishal choudhary
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
vishal choudhary
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
vishal choudhary
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
vishal choudhary
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
vishal choudhary
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
vishal choudhary
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
vishal choudhary
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
vishal choudhary
 

More from vishal choudhary (20)

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
 
XML.pptx
XML.pptxXML.pptx
XML.pptx
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
 
SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
 

Recently uploaded

The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 

Recently uploaded (20)

The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 

Lec_4_1_IntrotoPIG.pptx

  • 1. Pig Latin: A Not-So-Foreign Language for Data Processing
  • 2. Motivation  You‟re a procedural programmer  You have huge data  You want to analyze it 2
  • 3. Motivation  As a procedural programmer…  May find writing queries in SQL unnatural and too restrictive  More comfortable with writing code; a series of statements as opposed to a long query. (Ex: MapReduce is so successful). 3
  • 4. Motivation  Data analysis goals  Quick  Exploit parallel processing power of a distributed system  Easy  Be able to write a program or query without a huge learning curve  Have some common analysis tasks predefined  Flexible  Transform a data set(s) into a workable structure without much overhead  Perform customized processing  Transparent  Have a say in how the data processing is executed on the system 5
  • 5. Motivation  Relational Distributed Databases  Parallel database products expensive  Rigid schemas  Processing requires declarative SQL query construction  Map-Reduce  Relies on custom code for even common operations  Need to do workarounds for tasks that have different data flows other than the expected MapCombineReduce 6
  • 6. Motivation  Relational Distributed Databases  Sweet Spot: Take the best of both SQL and Map-Reduce; combine high-level declarative querying with low-level procedural programming…Pig Latin!  Map-Reduce 7
  • 7. Pig Latin Example Table urls: (url,category, pagerank) Find for each suffciently large category, the average pagerank of high- pagerank urls in that category SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
  • 8. Outline  System Overview  Pig Latin (The Language)  Data Structures  Commands  Pig (The Compiler)  Logical & Physical Plans  Optimization  Efficiency  Pig Pen (The Debugger)  Conclusion 8
  • 10. Data Model  Atom - simple atomic value (ie: number or string)  Tuple  Bag  Map 11
  • 11. Data Model  Atom  Tuple - sequence of fields; each field any type  Bag  Map 12
  • 12. Data Model  Atom  Tuple  Bag - collection of tuples  Duplicates possible  Tuples in a bag can have different field lengths and field types  Map 13
  • 13. Data Model  Atom  Tuple  Bag  Map - collection of key-value pairs  Key is an atom; value can be any type 14
  • 14. Data Model  Control over dataflow Ex 1 (less efficient) spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; Ex 2 (most efficient) highpgr_urls = FILTER urls BY pagerank > 0.8; spam_urls = FILTER highpgr_urls BY isSpam(url);  Fully nested  More natural for procedural programmers (target user) than normalization  Data is often stored on disk in a nested fashion  Facilitates ease of writing user-defined functions  No schema required 15
  • 15. Data Model  User-Defined Functions (UDFs)  Can be used in many Pig Latin statements  Useful for custom processing tasks  Can use non-atomic values for input and output  Currently must be written in Java 16  Ex: spam_urls = FILTER urls BY isSpam(url);
  • 16. Speaking Pig Latin  LOAD  Input is assumed to be a bag (sequence of tuples)  Can specify a deserializer with “USING‟  Can provide a schema with “AS‟ newBag = LOAD ‘filename’ <USING functionName() > <AS (fieldName1, fieldName2,…)>; 17 Queries = LOAD ‘query_log.txt’ USING myLoad() AS (userID,queryString, timeStamp)
  • 17. Speaking Pig Latin  FOREACH  Apply some processing to each tuple in a bag  Each field can be:  A fieldname of the bag  A constant  A simple expression (ie: f1+f2)  A predefined function (ie: SUM, AVG, COUNT, FLATTEN)  A UDF (ie: sumTaxes(gst, pst) ) newBag = FOREACH bagName GENERATE field1, field2, …; 18
  • 18. Speaking Pig Latin  FILTER  Select a subset of the tuples in a bag newBag = FILTER bagName BY expression;  Expression uses simple comparison operators (==, !=, <, >, …) and Logical connectors (AND, NOT, OR) some_apples = FILTER apples BY colour != ‘red’;  Can use UDFs some_apples = FILTER apples BY NOT isRed(colour); 19
  • 19. Speaking Pig Latin  COGROUP  Group two datasets together by a common attribute  Groups data into nested bags grouped_data = COGROUP results BY queryString, revenue BY queryString; 20
  • 20. Speaking Pig Latin  Why COGROUP and not JOIN? url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRev(results, revenue)); 21
  • 21. Speaking Pig Latin  Why COGROUP and not JOIN?  May want to process nested bags of tuples before taking the cross product.  Keeps to the goal of a single high-level data transformation per pig-latin statement.  However, JOIN keyword is still available: JOIN results BY queryString, revenue BY queryString; Equivalent temp = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp GENERATE FLATTEN(results), FLATTEN(revenue); 22
  • 22. Speaking Pig Latin  STORE (& DUMP)  Output data to a file (or screen) STORE bagName INTO ‘filename’ <USING deserializer ()>;  Other Commands (incomplete)  UNION - return the union of two or more bags  CROSS - take the cross product of two or more bags  ORDER - order tuples by a specified field(s)  DISTINCT - eliminate duplicate tuples in a bag  LIMIT - Limit results to a subset 23
  • 23. Compilation  Pig system does two tasks:  Builds a Logical Plan from a Pig Latin script  Supports execution platform independence  No processing of data performed at this stage  Compiles the Logical Plan to a Physical Plan and Executes  Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce 24
  • 24. Compilation  Building a Logical Plan  Verify input files and bags referred to are valid  Create a logical plan for each bag(variable) defined 25
  • 25. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 26
  • 26. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 27
  • 27. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach 28
  • 28. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach Filter 29
  • 29. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Foreach 30
  • 30. Compilation  Building a Physical Plan A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Only happens when output is specified by STORE or DUMP Foreach 32
  • 31. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP Map Reduce Load(user.dat) Filter Group Foreach 33
  • 32. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP  Step 2: Push other commands into the map and reduce functions where Map possible  May be the case certain commands require their own map-reduce Reduce Load(user.dat) Filter Group job (ie: ORDER needs separate map- reduce jobs) Foreach 34
  • 33. Compilation  Efficiency in Execution  Parallelism  Loading data - Files are loaded from HDFS  Statements are compiled into map-reduce jobs 35
  • 34. Compilation  Efficiency with Nested Bags  In many cases, the nested bags created in each tuple of a COGROUP statement never need to physically materialize  Generally perform aggregation after a COGROUP and the statements for said aggregation are pushed into the reduce function  Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG) 36
  • 35. Compilation  Efficiency with Nested Bags Map Load(user.dat) Filter Group Foreach 37
  • 36. Compilation  Efficiency with Nested Bags Load(user.dat) Filter Group Combine Foreach 38
  • 37. Compilation  Efficiency with Nested Bags Reduce Load(user.dat) Filter Group Foreach 39
  • 38. Compilation  Efficiency with Nested Bags  Why this works:  COUNT is an algebraic function; it can be structured as a tree of sub- functions with each leaf working on a subset of the data Reduce SUM Combine COUNT COUNT 40
  • 39. Compilation  Efficiency with Nested Bags  Pig provides an interface for writing algebraic UDFs so they can take advantage of this optimization as well.  Inefficiencies  Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to materialize; may cause a very large bag to spill to disk if it doesn‟t fit in memory  Every map-reduce job requires data be written and replicated to the HDFS (although this is offset by parallelism achieved) 41
  • 40. Debugging  How to verify the semantics of an analysis program  Run the program against whole data set. Might take hours!  Generate sample dataset  Empty result set may occur on few operations like join, filter  Generally, testing with sample dataset is difficult  Pig-Pen  Samples data from large dataset for Pig statements  Apply individual Pig-Latin commands against the dataset  In case of empty result, pig system resamples  Remove redundant samples 42
  • 42. Debugging  Pig-Latin command window and command generator 43
  • 43. Debugging  Sand Box Dataset (generated automatically!) 44
  • 44. Debugging  Pig-Pen  Provides sample data that is:  Real - taken from actual data  Concise - as small as possible  Complete - collectively illustrate the key semantics of each command  Helps with schema definition  Facilitates incremental program writing 45
  • 45. Conclusion  Pig is a data processing environment in Hadoop that is specifically targeted towards procedural programmers who perform large-scale data analysis.  Pig-Latin offers high-level data manipulation in a procedural style.  Pig-Pen is a debugging environment for Pig-Latin commands that generates samples from real data. 47
  • 46. More Info  Pig, http://hadoop.apache.org/pig/  Hadoop, http://hadoop.apache.org Anks- Thay! 48