SlideShare a Scribd company logo
1 of 46
Pig Latin: A Not-So-Foreign
Language for Data Processing
Motivation
 You‟re a procedural programmer
 You have huge data
 You want to analyze it
2
Motivation
 As a procedural programmer…
 May find writing queries in SQL unnatural and too restrictive
 More comfortable with writing code; a series of statements as
opposed to a long query. (Ex: MapReduce is so successful).
3
Motivation
 Data analysis goals
 Quick
 Exploit parallel processing power of a distributed system
 Easy
 Be able to write a program or query without a huge learning curve
 Have some common analysis tasks predefined
 Flexible
 Transform a data set(s) into a workable structure without much
overhead
 Perform customized processing
 Transparent
 Have a say in how the data processing is executed on the system
5
Motivation
 Relational Distributed Databases
 Parallel database products expensive
 Rigid schemas
 Processing requires declarative SQL query construction
 Map-Reduce
 Relies on custom code for even common operations
 Need to do workarounds for tasks that have different data
flows other than the expected MapCombineReduce
6
Motivation
 Relational Distributed Databases
 Sweet Spot: Take the best of both SQL and Map-Reduce;
combine high-level declarative querying with low-level
procedural programming…Pig Latin!
 Map-Reduce
7
Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-
pagerank urls in that category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
Outline
 System Overview
 Pig Latin (The Language)
 Data Structures
 Commands
 Pig (The Compiler)
 Logical & Physical Plans
 Optimization
 Efficiency
 Pig Pen (The Debugger)
 Conclusion
8
Big Picture
Pig Latin
Script
User-
Defined
Functions
Pig
Map-Reduce
Statements
Compile
Optimize
Write Results Read Data
10
Data Model
 Atom - simple atomic value (ie: number or string)
 Tuple
 Bag
 Map
11
Data Model
 Atom
 Tuple - sequence of fields; each field any type
 Bag
 Map
12
Data Model
 Atom
 Tuple
 Bag - collection of tuples
 Duplicates possible
 Tuples in a bag can have different field lengths and field types
 Map
13
Data Model
 Atom
 Tuple
 Bag
 Map - collection of key-value pairs
 Key is an atom; value can be any type
14
Data Model
 Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested
 More natural for procedural programmers (target user) than
normalization
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions
 No schema required
15
Data Model
 User-Defined Functions (UDFs)
 Can be used in many Pig Latin statements
 Useful for custom processing tasks
 Can use non-atomic values for input and output
 Currently must be written in Java
16
 Ex: spam_urls = FILTER urls BY isSpam(url);
Speaking Pig Latin
 LOAD
 Input is assumed to be a bag (sequence of tuples)
 Can specify a deserializer with “USING‟
 Can provide a schema with “AS‟
newBag = LOAD ‘filename’
<USING functionName() >
<AS (fieldName1, fieldName2,…)>;
17
Queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userID,queryString, timeStamp)
Speaking Pig Latin
 FOREACH
 Apply some processing to each tuple in a bag
 Each field can be:
 A fieldname of the bag
 A constant
 A simple expression (ie: f1+f2)
 A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
 A UDF (ie: sumTaxes(gst, pst) )
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
Speaking Pig Latin
 FILTER
 Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
 Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
 Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
Speaking Pig Latin
 COGROUP
 Group two datasets together by a common attribute
 Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
Speaking Pig Latin
 Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
Speaking Pig Latin
 Why COGROUP and not JOIN?
 May want to process nested bags of tuples before taking the
cross product.
 Keeps to the goal of a single high-level data transformation per
pig-latin statement.
 However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
Speaking Pig Latin
 STORE (& DUMP)
 Output data to a file (or screen)
STORE bagName INTO ‘filename’
<USING deserializer ()>;
 Other Commands (incomplete)
 UNION - return the union of two or more bags
 CROSS - take the cross product of two or more bags
 ORDER - order tuples by a specified field(s)
 DISTINCT - eliminate duplicate tuples in a bag
 LIMIT - Limit results to a subset
23
Compilation
 Pig system does two tasks:
 Builds a Logical Plan from a Pig Latin script
 Supports execution platform independence
 No processing of data performed at this stage
 Compiles the Logical Plan to a Physical Plan and Executes
 Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
24
Compilation
 Building a Logical Plan
 Verify input files and bags referred to are valid
 Create a logical plan for each bag(variable) defined
25
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
27
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
28
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
Filter
29
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Foreach
30
Compilation
 Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
33
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
COGROUP
 Step 2: Push other commands into the
map and reduce functions where Map
possible
 May be the case certain commands
require their own map-reduce
Reduce
Load(user.dat)
Filter
Group
job (ie: ORDER needs separate map-
reduce jobs)
Foreach
34
Compilation
 Efficiency in Execution
 Parallelism
 Loading data - Files are loaded from HDFS
 Statements are compiled into map-reduce jobs
35
Compilation
 Efficiency with Nested Bags
 In many cases, the nested bags created in each tuple of a COGROUP
statement never need to physically materialize
 Generally perform aggregation after a COGROUP and the
statements for said aggregation are pushed into the reduce function
 Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)
36
Compilation
 Efficiency with Nested Bags
Map
Load(user.dat)
Filter
Group
Foreach
37
Compilation
 Efficiency with Nested Bags Load(user.dat)
Filter
Group
Combine
Foreach
38
Compilation
 Efficiency with Nested Bags
Reduce
Load(user.dat)
Filter
Group
Foreach
39
Compilation
 Efficiency with Nested Bags
 Why this works:
 COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data
Reduce SUM
Combine COUNT COUNT
40
Compilation
 Efficiency with Nested Bags
 Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
 Inefficiencies
 Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to
materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory
 Every map-reduce job requires data be written and replicated to the
HDFS (although this is offset by parallelism achieved)
41
Debugging
 How to verify the semantics of an analysis program
 Run the program against whole data set. Might take hours!
 Generate sample dataset
 Empty result set may occur on few operations like join, filter
 Generally, testing with sample dataset is difficult
 Pig-Pen
 Samples data from large dataset for Pig statements
 Apply individual Pig-Latin commands against the dataset
 In case of empty result, pig system resamples
 Remove redundant samples
42
Debugging
 Pig-Pen
42
Debugging
 Pig-Latin command window and command generator
43
Debugging
 Sand Box Dataset (generated automatically!)
44
Debugging
 Pig-Pen
 Provides sample data that is:
 Real - taken from actual data
 Concise - as small as possible
 Complete - collectively illustrate the key semantics of each command
 Helps with schema definition
 Facilitates incremental program writing
45
Conclusion
 Pig is a data processing environment in Hadoop that is
specifically targeted towards procedural programmers
who perform large-scale data analysis.
 Pig-Latin offers high-level data manipulation in a
procedural style.
 Pig-Pen is a debugging environment for Pig-Latin
commands that generates samples from real data.
47
More Info
 Pig, http://hadoop.apache.org/pig/
 Hadoop, http://hadoop.apache.org
Anks-
Thay!
48

More Related Content

What's hot

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 

What's hot (20)

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
lec1_ref.pdf
lec1_ref.pdflec1_ref.pdf
lec1_ref.pdf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 

Similar to Pig Latin Language for Data Processing

Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLJim Mlodgenski
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigQiu Xiafei
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
MapfilterreducepresentationManjuKumara GH
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonRoss McDonald
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 

Similar to Pig Latin Language for Data Processing (20)

4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Pig
PigPig
Pig
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 

More from vishal choudhary (20)

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
 
XML.pptx
XML.pptxXML.pptx
XML.pptx
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
 
SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
 

Recently uploaded

Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 

Recently uploaded (20)

Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 

Pig Latin Language for Data Processing

  • 1. Pig Latin: A Not-So-Foreign Language for Data Processing
  • 2. Motivation  You‟re a procedural programmer  You have huge data  You want to analyze it 2
  • 3. Motivation  As a procedural programmer…  May find writing queries in SQL unnatural and too restrictive  More comfortable with writing code; a series of statements as opposed to a long query. (Ex: MapReduce is so successful). 3
  • 4. Motivation  Data analysis goals  Quick  Exploit parallel processing power of a distributed system  Easy  Be able to write a program or query without a huge learning curve  Have some common analysis tasks predefined  Flexible  Transform a data set(s) into a workable structure without much overhead  Perform customized processing  Transparent  Have a say in how the data processing is executed on the system 5
  • 5. Motivation  Relational Distributed Databases  Parallel database products expensive  Rigid schemas  Processing requires declarative SQL query construction  Map-Reduce  Relies on custom code for even common operations  Need to do workarounds for tasks that have different data flows other than the expected MapCombineReduce 6
  • 6. Motivation  Relational Distributed Databases  Sweet Spot: Take the best of both SQL and Map-Reduce; combine high-level declarative querying with low-level procedural programming…Pig Latin!  Map-Reduce 7
  • 7. Pig Latin Example Table urls: (url,category, pagerank) Find for each suffciently large category, the average pagerank of high- pagerank urls in that category SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
  • 8. Outline  System Overview  Pig Latin (The Language)  Data Structures  Commands  Pig (The Compiler)  Logical & Physical Plans  Optimization  Efficiency  Pig Pen (The Debugger)  Conclusion 8
  • 10. Data Model  Atom - simple atomic value (ie: number or string)  Tuple  Bag  Map 11
  • 11. Data Model  Atom  Tuple - sequence of fields; each field any type  Bag  Map 12
  • 12. Data Model  Atom  Tuple  Bag - collection of tuples  Duplicates possible  Tuples in a bag can have different field lengths and field types  Map 13
  • 13. Data Model  Atom  Tuple  Bag  Map - collection of key-value pairs  Key is an atom; value can be any type 14
  • 14. Data Model  Control over dataflow Ex 1 (less efficient) spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; Ex 2 (most efficient) highpgr_urls = FILTER urls BY pagerank > 0.8; spam_urls = FILTER highpgr_urls BY isSpam(url);  Fully nested  More natural for procedural programmers (target user) than normalization  Data is often stored on disk in a nested fashion  Facilitates ease of writing user-defined functions  No schema required 15
  • 15. Data Model  User-Defined Functions (UDFs)  Can be used in many Pig Latin statements  Useful for custom processing tasks  Can use non-atomic values for input and output  Currently must be written in Java 16  Ex: spam_urls = FILTER urls BY isSpam(url);
  • 16. Speaking Pig Latin  LOAD  Input is assumed to be a bag (sequence of tuples)  Can specify a deserializer with “USING‟  Can provide a schema with “AS‟ newBag = LOAD ‘filename’ <USING functionName() > <AS (fieldName1, fieldName2,…)>; 17 Queries = LOAD ‘query_log.txt’ USING myLoad() AS (userID,queryString, timeStamp)
  • 17. Speaking Pig Latin  FOREACH  Apply some processing to each tuple in a bag  Each field can be:  A fieldname of the bag  A constant  A simple expression (ie: f1+f2)  A predefined function (ie: SUM, AVG, COUNT, FLATTEN)  A UDF (ie: sumTaxes(gst, pst) ) newBag = FOREACH bagName GENERATE field1, field2, …; 18
  • 18. Speaking Pig Latin  FILTER  Select a subset of the tuples in a bag newBag = FILTER bagName BY expression;  Expression uses simple comparison operators (==, !=, <, >, …) and Logical connectors (AND, NOT, OR) some_apples = FILTER apples BY colour != ‘red’;  Can use UDFs some_apples = FILTER apples BY NOT isRed(colour); 19
  • 19. Speaking Pig Latin  COGROUP  Group two datasets together by a common attribute  Groups data into nested bags grouped_data = COGROUP results BY queryString, revenue BY queryString; 20
  • 20. Speaking Pig Latin  Why COGROUP and not JOIN? url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRev(results, revenue)); 21
  • 21. Speaking Pig Latin  Why COGROUP and not JOIN?  May want to process nested bags of tuples before taking the cross product.  Keeps to the goal of a single high-level data transformation per pig-latin statement.  However, JOIN keyword is still available: JOIN results BY queryString, revenue BY queryString; Equivalent temp = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp GENERATE FLATTEN(results), FLATTEN(revenue); 22
  • 22. Speaking Pig Latin  STORE (& DUMP)  Output data to a file (or screen) STORE bagName INTO ‘filename’ <USING deserializer ()>;  Other Commands (incomplete)  UNION - return the union of two or more bags  CROSS - take the cross product of two or more bags  ORDER - order tuples by a specified field(s)  DISTINCT - eliminate duplicate tuples in a bag  LIMIT - Limit results to a subset 23
  • 23. Compilation  Pig system does two tasks:  Builds a Logical Plan from a Pig Latin script  Supports execution platform independence  No processing of data performed at this stage  Compiles the Logical Plan to a Physical Plan and Executes  Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce 24
  • 24. Compilation  Building a Logical Plan  Verify input files and bags referred to are valid  Create a logical plan for each bag(variable) defined 25
  • 25. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 26
  • 26. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 27
  • 27. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach 28
  • 28. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach Filter 29
  • 29. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Foreach 30
  • 30. Compilation  Building a Physical Plan A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Only happens when output is specified by STORE or DUMP Foreach 32
  • 31. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP Map Reduce Load(user.dat) Filter Group Foreach 33
  • 32. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP  Step 2: Push other commands into the map and reduce functions where Map possible  May be the case certain commands require their own map-reduce Reduce Load(user.dat) Filter Group job (ie: ORDER needs separate map- reduce jobs) Foreach 34
  • 33. Compilation  Efficiency in Execution  Parallelism  Loading data - Files are loaded from HDFS  Statements are compiled into map-reduce jobs 35
  • 34. Compilation  Efficiency with Nested Bags  In many cases, the nested bags created in each tuple of a COGROUP statement never need to physically materialize  Generally perform aggregation after a COGROUP and the statements for said aggregation are pushed into the reduce function  Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG) 36
  • 35. Compilation  Efficiency with Nested Bags Map Load(user.dat) Filter Group Foreach 37
  • 36. Compilation  Efficiency with Nested Bags Load(user.dat) Filter Group Combine Foreach 38
  • 37. Compilation  Efficiency with Nested Bags Reduce Load(user.dat) Filter Group Foreach 39
  • 38. Compilation  Efficiency with Nested Bags  Why this works:  COUNT is an algebraic function; it can be structured as a tree of sub- functions with each leaf working on a subset of the data Reduce SUM Combine COUNT COUNT 40
  • 39. Compilation  Efficiency with Nested Bags  Pig provides an interface for writing algebraic UDFs so they can take advantage of this optimization as well.  Inefficiencies  Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to materialize; may cause a very large bag to spill to disk if it doesn‟t fit in memory  Every map-reduce job requires data be written and replicated to the HDFS (although this is offset by parallelism achieved) 41
  • 40. Debugging  How to verify the semantics of an analysis program  Run the program against whole data set. Might take hours!  Generate sample dataset  Empty result set may occur on few operations like join, filter  Generally, testing with sample dataset is difficult  Pig-Pen  Samples data from large dataset for Pig statements  Apply individual Pig-Latin commands against the dataset  In case of empty result, pig system resamples  Remove redundant samples 42
  • 42. Debugging  Pig-Latin command window and command generator 43
  • 43. Debugging  Sand Box Dataset (generated automatically!) 44
  • 44. Debugging  Pig-Pen  Provides sample data that is:  Real - taken from actual data  Concise - as small as possible  Complete - collectively illustrate the key semantics of each command  Helps with schema definition  Facilitates incremental program writing 45
  • 45. Conclusion  Pig is a data processing environment in Hadoop that is specifically targeted towards procedural programmers who perform large-scale data analysis.  Pig-Latin offers high-level data manipulation in a procedural style.  Pig-Pen is a debugging environment for Pig-Latin commands that generates samples from real data. 47
  • 46. More Info  Pig, http://hadoop.apache.org/pig/  Hadoop, http://hadoop.apache.org Anks- Thay! 48