SlideShare a Scribd company logo
Apache Pig
Prashant Gupta
PIG Latin
• Pig Latin is a data flow language used for exploring large data sets.
• Rapid development
• No Java is required.
• Its is a high-level platform for creating MapReduce programs used
with Hadoop.
• Pig was originally developed at Yahoo Research around 2006 for
researchers to have an ad-hoc way of creating and executing map-
reduce jobs on very large data sets. In 2007,it was moved into the
Apache Software Foundation
• Like actual pigs, who eat almost anything, the Pig programming
language is designed to handle any kind of data—hence the name!
What Pig Does
Pig was designed for performing a long series of data operations,
making it ideal for three categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
Features of PIG
• Provides support for data types – long, float, chararray, schemas
and functions
• Is extensible and supports User Defined Functions
• Schema not mandatory, but used when available
• Provides common operations like JOIN, GROUP, FILTER, SORT
When not to use PIG
• Really nasty data formats or complete unstructured data.
– Video Files
– Audio Files
– Image Files
– Raw human readable text
• PIG is slow compared to Map-Reduce
• When you need more power to optimize code.
PIG Use Case
PIG Components
I Install PIG
•To install pig
• untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz
•To initialize the environment variables, export the following:
(Specifies the version of hadoop that is running)
• export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment
variable HADOOP_HOME. Typically defined as /home/user-
(Specifies the class path for pig)
• export PATH=$PATH:/home/user-name/pig-0.13.1/bin
(for setting the PATH variable)
• export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)
PIG Modes
• Pig in Local mode
– No HDFS is required, All files run on local file system.
– Command: pig –x local
• Pig in MapReduce(hadoop) mode
– To run PIG scripts in MR mode, ensure you have access to
HDFS, By Default, PIG starts in MapReduce Mode.
– Command: pig –x mapreduce or pig
PIG Program Structure
• Grunt Shell or Interactive mode
– Grunt is an interactive shell for running PIG commands.
• PIG Scripts or Batch mode
– PIG can run a script file that contains PIG commands.
– E.g. PIG script.pig
Introducing data types
• Data type is a data storage format that can contain a specific type or
range of values.
– Scalar types
• Sample: int, long, double, chararray, bytearray
– Complex types
• Sample: Atom, Tuple, Bag, Map
• User can declare data type at load time as below.
– A= LOAD ‘’ using PigStorage(',') AS (sno:chararray,
name: chararray, marks:long);
• If data type is not declared but script treats value as a certain type,
Pig will assume it is of that type and cast it.
– A= LOAD ‘’ using PigStorage(',') AS (sno, name,
– B = FOREACH A GENERATE marks* 100; --marks cast to long
Data types continues…
Relation can be defined as follows:
• A field/Atom is a piece of data.
Ex:12.5 or hello world
• A tuple is an ordered set of fields.
EX: Tuple (12.5,hello world,-2)
It’s most often used as a row in a relation.
It’s represented by fields separated by commas, enclosed by
• A bag is a collection of tuples.
Bag {(12.5,hello world,-2),(2.87,bye world,10)}
A bag is an unordered collection of tuples.
A bag is represented by tuples separated by commas, all
enclosed by curly
• Map [key value]
A map is a set of key/value pairs.
Keys must be unique and be a string (chararray).
The value can be any type.
In sort ..
Relations, Bags, Tuples, Fields
Pig Latin statements work with relations, A relation can be defined as
• A relation is a bag (more specifically, an outer bag).
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
PIG Latin Statements
• A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
• This definition applies to all Pig Latin operators except LOAD and
STORE command which read data from and write data to the file
• In PIG when a data element is null it means its unknown. Data of
any type can be null.
• Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; )
PIG The programming language
• Pig Latin statements are generally organized in the following
– A LOAD statement reads data from the file system.
– A series of "transformation" statements process the data.
– A STORE statement writes output to the file system;
– A DUMP statement displays output to the screen
•Because DUMP is a diagnostic tool, it will always trigger execution.
However, the STORE command is different.
• In interactive mode, STORE acts like DUMP and will always trigger
execution (this includes the run command), but in batch mode it will not
(this includes the exec command).
•The reason for this is efficiency. In batch mode, Pig will parse the
whole script to see whether there are any optimizations that could be
made to limit the amount of data to be written to or read from disk.
Consider the following simple example:
• A = LOAD 'input/pig/multiquery/A';
• B = FILTER A BY $1 == 'banana';
• C = FILTER A BY $1 != 'banana';
• STORE B INTO 'output/b';
• STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save reading A twice,
Pig can run this script as a single MapReduce job by reading A once
and writing two output files from the job, one for each of B and C. This
feature is called multiquery execution.
Working with Data
File System Commands
Utility Commands
Logical vs. Physical Plan
When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct
and adds it to the logical plan, but it does not load the data from the file
(or even check whether the file exists).
The point is that it makes no sense to start any processing until the
whole flow is defined. Similarly, Pig validates the GROUP and
FOREACH…GENERATE statements, and adds them to the logical
plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a
physical plan and executed.
Practice Session
Create a sample file
Save it as “student.txt”
Move it to HDFS by using below command.
hadoop fs – put <local path - filename> hdfspath
A = load 'student' using PigStorage(‘,’) AS
A: {name: chararray,age: int,gpa: float}
store A into ‘/hdfspath’;
Groups the data in one relations.
B = GROUP A BY age;
Create Sample File
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
Move it to HDFS by using below command.
hadoop fs – put <localpath> <hdfspath>
Create another Sample File
2 4
8 9
1 3
2 7
2 9
4 6
4 9
Move it to HDFS by using below command.
hadoop fs – put localpath hdfspath
Definition: Selects tuples from a relation based on some condition.
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int);
X = FILTER A BY a3 == 3;
Definition: The GROUP and COGROUP operators are identical. For
readability GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations.
X = COGROUP A BY $0, B BY $0;
(1, {(1, 2, 3)}, {(1, 3)})
(2, {}, {(2, 4), (2, 7), (2, 9)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)})
(7, {(7, 2, 5)}, {})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
•To see groups for which inputs have at least one tuple:
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
2 4
8 9
1 3
2 7
2 9
4 6
4 9
Flatten Operator
• Flatten un-nests tuples as well as bags.
• For tuples, flatten substitutes the fields of a tuple in place of the tuple.
• For example, consider a relation (a, (b, c)).
• GENERATE $0, flatten($1)
– (a, b, c).
• For bags, flatten substitutes bags with new tuples.
• For Example, consider a bag ({(b,c),(d,e)}).
• GENERATE flatten($0),
– will end up with two tuples (b,c) and (d,e).
• When we remove a level of nesting in a bag, sometimes we cause a cross product to
• For example, consider a relation (a, {(b,c), (d,e)})
• GENERATE $0, flatten($1),
– it will create new tuples: (a, b, c) and (a, d, e).
Definition: Performs join of two or more relations based on common field values
X= JOIN A BY $0, B BY $0;
which is equivalent to:
The result is:
(1, 2, 3, 1, 3)
(4, 2, 1, 4, 6)
(4, 3, 3, 4, 6)
(4, 2, 1, 4, 9)
(4, 3, 3, 4, 9)
(8, 3, 4, 8, 9)
(8, 4, 3, 8, 9)
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
Removes duplicate tuples in a relation.
•Computes the cross product of two or more relations.
Example: X = CROSS A, B;
(1, 2, 3, 2, 4)
(1, 2, 3, 8, 9)
(1, 2, 3, 1, 3)
(1, 2, 3, 2, 7)
(1, 2, 3, 2, 9)
(1, 2, 3, 4, 6)
(1, 2, 3, 4, 9)
(4, 2, 1, 2, 4)
(4, 2, 1, 8, 9)
Partitions a relation into two or more relations.
Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A; (1,2,3) (4,5,6) (7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
(1,2,3) (7,8,9)
Some more commands
• To select few columns from one dataset
– S1 = foreach a generate a1, a1;
• Simple calculation on dataset
– K = foreach A generate $1, $2, $1*$2;
• To display only 100 records
– B = limit a 100;
• To see the structure/Schema
– Describe A;
• To Union two datasets
– C = UNION A,B;
Word Count Program
Create a basic wordsample.txt file and move to
x = load '/home/pgupta5/prashant/data.txt';
y = foreach x generate flatten (TOKENIZE ((chararray) $0))
as word;
z = group y by word;
counter = foreach z generate group, COUNT(y);
store counter into ‘/NewPigData/WordCount’;
Another Example
i/p: webcount
en 70 2012
en 60 2013
us 80 2012
en 40 2014
us 80 2012
records = LOAD ‘webcount’ using PigStorage (‘t’) as (country:chararray,
name:chararray, pagecount:int, year:int);
filtered_records = filter records by country == ‘en’;
grouped_records = group filtered_records by name;
results = foreach grouped_records generate group, SUM
sorted_result = order results by $1 desc;
store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive
external table path
Find Maximum Score
i/p: CricketScore.txt
a = load '/user/cloudera/SampleDataFile/CricketScore.txt'
using PigStorage('t');
b = foreach a generate $0, $1;
c = group b by $0;
d = foreach c generate group, max(b.$1);
dump d;
Sorting Data
Relations are unordered in Pig.
Consider a relation A:
• grunt> DUMP A;
• (2,3)
• (1,2)
• (2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If you want to impose an order on the output, you can use the ORDER
operator to sort a relation by one or more fields.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
• (1,2)
• (2,4)
• (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order.
Using Hive tables with HCatalog
• HCatalog (which is a component of Hive) provides
access to Hive’s metastore, so that Pig queries can
reference schemas each time.
• For example, after running through An Example to load
data into a Hive table called records, Pig can access the
table’s schema and data as follows:
• pig -useHCatalog
• grunt> records = LOAD ‘School_db.student_tbl'
USING org.apache.hcatalog.pig.HCatLoader();
• grunt> DESCRIBE records;
• grunt> DUMP records;
Pig provides extensive support for user defined functions (UDFs) to
specify custom processing.
REGISTER - Registers the JAR file with PIG runtime.
REGISTER myudfs.jar;
//JAR file should be available in local LINUX.
A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray,
age: int, gpa: float);
UDF Sample Program
package myudfs;
import org.apache.pig.EvalFunc;
public class UPPER extends EvalFunc<String>
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
• (Pig’s Java UDF extends functionalities of EvalFunc)
Diagnostic operator
• DESCRIBE: Prints a relation’s schema.
• EXPLAIN: Prints the logical and physical plans.
• ILLUSTRATE: Shows a sample execution of the logical
plan, using a generated subset of the input.
Performance Tuning
Pig does not (yet) determine when a field is no longer needed and drop the field from the
row. For example, say you have a query like:
• Project Early and Often
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
• There is no need for v, y, or z to participate in this query. And there is no need to
carry both t and x past the join, just one will suffice. Changing the query above to the
query below will greatly reduce the amount of data being carried through the map and
reduce phases by pig.
– A = load 'myfile' as (t, u, v);
– A1 = foreach A generate t, u;
– B = load 'myotherfile' as (x, y, z);
– B1 = foreach B generate x;
– C = join A1 by t, B1 by x;
– C1 = foreach C generate t, u;
– D = group C1 by u;
– E = foreach D generate group, COUNT($1);
Performance Tuning
As with early projection, in most cases it is beneficial to apply filters as early as possible
to reduce the amount of data flowing through the pipeline.
• Filter Early and Often
-- Query 1
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = filter A by t == 1;
– D = join C by t, B by x;
– E = group D by u;
– F = foreach E generate group, COUNT($1);
-- Query 2
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
– F = filter E by C.t == 1;
• The first query is clearly more efficient than the second one because
it reduces the amount of data going into the join.
Performance Tuning
Often you are not interested in the entire output but rather a
sample or top results. In such cases, LIMIT can yield a
much better performance as we push the limit as high as
possible to minimize the amount of data travelling through
the pipeline.
• Use the LIMIT Operator
– A = load 'myfile' as (t, u, v);
– B = order A by t;
– C = limit B 500;
Performance Tuning
If types are not specified in the load statement, Pig assumes the
type of double for numeric computations. A lot of the time, your
data would be much smaller, maybe, integer or long. Specifying
the real type will help with speed of arithmetic computation.
• Use Types
– --Query 1
• A = load 'myfile' as (t, u, v);
• B = foreach A generate t + u;
– --Query 2
• A = load 'myfile' as (t: int, u: int, v);
• B = foreach A generate t + u;
• The second query will run more efficiently than the first. In
some of our queries with see 2x speedup.
Performance Tuning
• Use Joins appropriately.
– Understand Skewed Vs. Replicated vs. Merge join.
• Remove null values before join.
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
• is rewritten by Pig to
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C1 = cogroup A by t INNER, B by x INNER;
– C = foreach C1 generate flatten(A), flatten(B);
Since the nulls from A and B won't be collected together,
when the nulls are flattened we're guaranteed to have an
empty bag, which will result in no output. But they will not
be dropped until the last possible moment.
Performance Tuning
• Hence the previous query should be rewritten as
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– A1 = filter A by t is not null;
– B1 = filter B by x is not null;
– C = join A1 by t, B1 by x;
Now nulls will be dropped before the join. Since all null
keys go to a single reducer, if your key is null even a small
percentage of the time the gain can be significant.
Performance Tuning
• You can set the number of reduce tasks for the
MapReduce jobs generated by Pig using parallel
reducer feature.
– set default parallel command is used at the script
• In this example all the MapReduce jobs gets launched use 20
– SET default_parallel 20;
– A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
– B = GROUP A BY t;
– C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
– D = ORDER C BY mycount;
– PARALLEL clause can be used with any operator like
group, cogroup, join, order by, distinct that starts
reduce phase.
Replicated Join
• One of the datasets is small enough that it fits in the memory.
• A replicated join copies the small dataset to the distributed cache -
space that is available on every cluster machine - and loads it into
the memory.
• Coz the data is available in the memory(DC), and is processed on
the map side of MapReduce, this operation works much faster than
a default join.
• Limitations
It isn’t clear how small the dataset needs to be for using replicated join.
According to the Pig documentation, a relation of up to 100 MB can
be used when the process has 1 GB of memory. A run-time error will
be generated if not enough memory is available for loading the data.
• transactions = load 'customer_transactions' as ( fname, lname, city,
state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district, manager);
Regular join
• sales = join transactions by (state, country), geography by (state,
• sales = join transactions by (state, country), geography by (state,
country) using 'replicated';
Skewed Join
• One of the keys is much more common than others, and the data for
it is too large to fit in the memory.
• Standard joins run in parallel across different reducers by splitting
key values across processes. If there is a lot of data for a certain
key, the data will not be distributed evenly across the reducers, and
one of them will be ‘stuck’ processing the majority of data.
• Skewed join handles this case. It calculates a histogram to check
which key is the most prevalent and then splits its data across
different reducers for optimal performance.
• transactions = load 'customer_transactions' as ( fname,
lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district,
• sales = join transactions by (state, country), geography
by (state, country) using 'skewed';
Merge Join
• The two datasets are both sorted in ascending order by the join key.
• Datasets may already be sorted by the join key if that’s the order in
which data was entered or they have undergone sorting before the
join operation for other needs.
• When merge join receives the pre-sorted datasets, they are read
and compared on the map side, and as a result they run faster. Both
inner and outer join are available.
• transactions = load 'customer_transactions' as (
fname, lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country,
district, manager);
• sales = join transactions by (state, country),
geography by (state, country) using 'merge';
Thank You
• Question?
• Feedback?

More Related Content

What's hot

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
Dushhyant Kumar
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
Apache Flume
Apache FlumeApache Flume
Apache Flume
Arinto Murdopo
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
Hadoop Hadoop
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
sravya raju
Apache PIG
Apache PIGApache PIG
Apache PIG
Anuja Gunale
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni

What's hot (20)

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Apache Flume
Apache FlumeApache Flume
Apache Flume
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Hadoop Hadoop
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Spark SQL
Spark SQLSpark SQL
Spark SQL
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
Apache PIG
Apache PIGApache PIG
Apache PIG
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL

Similar to Apache PIG

PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
Rahul Borate
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
Pig workshop
Pig workshopPig workshop
Pig workshop
Sudar Muthu
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
Apache pig
Apache pigApache pig
Apache pig
Jigar Parekh
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Wei-Yu Chen
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
Arjun Shah
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
Aasim Naveed
Pig latin
Pig latinPig latin
Pig latin
Sadiq Basha
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
Dan Morrill
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur

Similar to Apache PIG (20)

PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Pig workshop
Pig workshopPig workshop
Pig workshop
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
Apache pig
Apache pigApache pig
Apache pig
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
Pig latin
Pig latinPig latin
Pig latin
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur

More from Prashant Gupta

Spark core
Spark coreSpark core
Spark core
Prashant Gupta
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
Prashant Gupta
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
Prashant Gupta

More from Prashant Gupta (6)

Spark core
Spark coreSpark core
Spark core
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis

Recently uploaded

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Investigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_CrimesInvestigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_Crimes
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu

Recently uploaded (20)

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Investigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_CrimesInvestigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_Crimes
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Apache PIG

  • 2. PIG Latin • Pig Latin is a data flow language used for exploring large data sets. • Rapid development • No Java is required. • Its is a high-level platform for creating MapReduce programs used with Hadoop. • Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map- reduce jobs on very large data sets. In 2007,it was moved into the Apache Software Foundation • Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!
  • 3. What Pig Does Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs: • Extract-transform-load (ETL) data pipelines, • Research on raw data, and • Iterative data processing.
  • 4. Features of PIG • Provides support for data types – long, float, chararray, schemas and functions • Is extensible and supports User Defined Functions • Schema not mandatory, but used when available • Provides common operations like JOIN, GROUP, FILTER, SORT
  • 5. When not to use PIG • Really nasty data formats or complete unstructured data. – Video Files – Audio Files – Image Files – Raw human readable text • PIG is slow compared to Map-Reduce • When you need more power to optimize code.
  • 8. I Install PIG •To install pig • untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz •To initialize the environment variables, export the following: • export PIG_HADOOP_VERSION=20 (Specifies the version of hadoop that is running) • export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2 (Specifies the installation directory of hadoop to the environment variable HADOOP_HOME. Typically defined as /home/user- name/hadoop-version) • export PIG_CLASSPATH=$HADOOP_HOME/conf (Specifies the class path for pig) • export PATH=$PATH:/home/user-name/pig-0.13.1/bin (for setting the PATH variable) • export JAVA_HOME=/usr (Specifies the java home to the environment variable.)
  • 9. PIG Modes • Pig in Local mode – No HDFS is required, All files run on local file system. – Command: pig –x local • Pig in MapReduce(hadoop) mode – To run PIG scripts in MR mode, ensure you have access to HDFS, By Default, PIG starts in MapReduce Mode. – Command: pig –x mapreduce or pig
  • 10. PIG Program Structure • Grunt Shell or Interactive mode – Grunt is an interactive shell for running PIG commands. • PIG Scripts or Batch mode – PIG can run a script file that contains PIG commands. – E.g. PIG script.pig
  • 11. Introducing data types • Data type is a data storage format that can contain a specific type or range of values. – Scalar types • Sample: int, long, double, chararray, bytearray – Complex types • Sample: Atom, Tuple, Bag, Map
  • 12. • User can declare data type at load time as below. – A= LOAD ‘’ using PigStorage(',') AS (sno:chararray, name: chararray, marks:long); • If data type is not declared but script treats value as a certain type, Pig will assume it is of that type and cast it. – A= LOAD ‘’ using PigStorage(',') AS (sno, name, marks); – B = FOREACH A GENERATE marks* 100; --marks cast to long
  • 13.
  • 14. Data types continues… Relation can be defined as follows: • A field/Atom is a piece of data. Ex:12.5 or hello world • A tuple is an ordered set of fields. EX: Tuple (12.5,hello world,-2) It’s most often used as a row in a relation. It’s represented by fields separated by commas, enclosed by parentheses.
  • 15. • A bag is a collection of tuples. Bag {(12.5,hello world,-2),(2.87,bye world,10)} A bag is an unordered collection of tuples. A bag is represented by tuples separated by commas, all enclosed by curly • Map [key value] A map is a set of key/value pairs. Keys must be unique and be a string (chararray). The value can be any type.
  • 16. In sort .. Relations, Bags, Tuples, Fields Pig Latin statements work with relations, A relation can be defined as follows: • A relation is a bag (more specifically, an outer bag). • A bag is a collection of tuples. • A tuple is an ordered set of fields. • A field is a piece of data.
  • 17. PIG Latin Statements • A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. • This definition applies to all Pig Latin operators except LOAD and STORE command which read data from and write data to the file system. • In PIG when a data element is null it means its unknown. Data of any type can be null. • Pig Latin statements can span multiple lines and must end with a semi-colon ( ; )
  • 18. PIG The programming language • Pig Latin statements are generally organized in the following manner: – A LOAD statement reads data from the file system. – A series of "transformation" statements process the data. – A STORE statement writes output to the file system; OR – A DUMP statement displays output to the screen
  • 19. MULTIQUERY EXECUTION •Because DUMP is a diagnostic tool, it will always trigger execution. However, the STORE command is different. • In interactive mode, STORE acts like DUMP and will always trigger execution (this includes the run command), but in batch mode it will not (this includes the exec command). •The reason for this is efficiency. In batch mode, Pig will parse the whole script to see whether there are any optimizations that could be made to limit the amount of data to be written to or read from disk.
  • 20. Consider the following simple example: • A = LOAD 'input/pig/multiquery/A'; • B = FILTER A BY $1 == 'banana'; • C = FILTER A BY $1 != 'banana'; • STORE B INTO 'output/b'; • STORE C INTO 'output/c'; Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B and C. This feature is called multiquery execution.
  • 24. Logical vs. Physical Plan When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is syntactically and semantically correct and adds it to the logical plan, but it does not load the data from the file (or even check whether the file exists). The point is that it makes no sense to start any processing until the whole flow is defined. Similarly, Pig validates the GROUP and FOREACH…GENERATE statements, and adds them to the logical plan without executing them. The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled into a physical plan and executed.
  • 26. Create a sample file John,18,4.0 Mary,19,3.8 Bill,20,3.9 Joe,18,3.8 Save it as “student.txt” Move it to HDFS by using below command. hadoop fs – put <local path - filename> hdfspath
  • 27. LOAD/DUMP/STORE A = load 'student' using PigStorage(‘,’) AS (name:chararray,age:int,gpa:float); DESCRIBE A; A: {name: chararray,age: int,gpa: float} DUMP A; (John,18,4.0) (Mary,19,3.8) (Bill,20,3.9) (Joe,18,3.8) store A into ‘/hdfspath’;
  • 28. Group Groups the data in one relations. B = GROUP A BY age; DUMP B; (18,{(John,18,4.0),(Joe,18,3.8)}) (19,{(Mary,19,3.8)}) (20,{(Bill,20,3.9)})
  • 29. Foreach…Generate C = FOREACH B GENERATE group, COUNT(A); DUMP C; (18,2) (19,1) (20,1) C = FOREACH B GENERATE $0, $; DUMP C; (18,{(John),(Joe)}) (19,{(Mary)}) (20,{(Bill)})
  • 30. Create Sample File FileA.txt 1 2 3 4 2 1 8 3 4 4 3 3 7 2 5 8 4 3 Move it to HDFS by using below command. hadoop fs – put <localpath> <hdfspath>
  • 31. Create another Sample File FileB.txt 2 4 8 9 1 3 2 7 2 9 4 6 4 9 Move it to HDFS by using below command. hadoop fs – put localpath hdfspath
  • 32. Filter Definition: Selects tuples from a relation based on some condition. FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want. Examples A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) X = FILTER A BY a3 == 3; DUMP X; (1,2,3) (4,3,3) (8,4,3)
  • 33. Co-Group Definition: The GROUP and COGROUP operators are identical. For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. X = COGROUP A BY $0, B BY $0; (1, {(1, 2, 3)}, {(1, 3)}) (2, {}, {(2, 4), (2, 7), (2, 9)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)}) (7, {(7, 2, 5)}, {}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)}) •To see groups for which inputs have at least one tuple: X = COGROUP A BY $0 INNER, B BY $0 INNER; (1, {(1, 2, 3)}, {(1, 3)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)}) FileA 1 2 3 4 2 1 8 3 4 4 3 3 7 2 5 8 4 3 FileB.txt 2 4 8 9 1 3 2 7 2 9 4 6 4 9
  • 34. Flatten Operator • Flatten un-nests tuples as well as bags. • For tuples, flatten substitutes the fields of a tuple in place of the tuple. • For example, consider a relation (a, (b, c)). • GENERATE $0, flatten($1) – (a, b, c). • For bags, flatten substitutes bags with new tuples. • For Example, consider a bag ({(b,c),(d,e)}). • GENERATE flatten($0), – will end up with two tuples (b,c) and (d,e). • When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. • For example, consider a relation (a, {(b,c), (d,e)}) • GENERATE $0, flatten($1), – it will create new tuples: (a, b, c) and (a, d, e).
  • 35. JOIN Definition: Performs join of two or more relations based on common field values Syntax: X= JOIN A BY $0, B BY $0; which is equivalent to: X = COGROUP A BY $0 INNER, B BY $0 INNER; Y = FOREACH X GENERATE FLATTEN(A), FLATTEN(B); The result is: (1, 2, 3, 1, 3) (4, 2, 1, 4, 6) (4, 3, 3, 4, 6) (4, 2, 1, 4, 9) (4, 3, 3, 4, 9) (8, 3, 4, 8, 9) (8, 4, 3, 8, 9) (1, {(1, 2, 3)}, {(1, 3)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
  • 36. Distinct Removes duplicate tuples in a relation. X = FOREACH A GENERATE $2; (3) (1) (4) (3) (5) (3) Y = DISTINCT X; (1) (3) (4) (5)
  • 37. CROSS •Computes the cross product of two or more relations. Example: X = CROSS A, B; (1, 2, 3, 2, 4) (1, 2, 3, 8, 9) (1, 2, 3, 1, 3) (1, 2, 3, 2, 7) (1, 2, 3, 2, 9) (1, 2, 3, 4, 6) (1, 2, 3, 4, 9) (4, 2, 1, 2, 4) (4, 2, 1, 8, 9)
  • 38. SPLIT Partitions a relation into two or more relations. Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) DUMP Z; (1,2,3) (7,8,9)
  • 39. Some more commands • To select few columns from one dataset – S1 = foreach a generate a1, a1; • Simple calculation on dataset – K = foreach A generate $1, $2, $1*$2; • To display only 100 records – B = limit a 100; • To see the structure/Schema – Describe A; • To Union two datasets – C = UNION A,B;
  • 40. Word Count Program Create a basic wordsample.txt file and move to HDFS x = load '/home/pgupta5/prashant/data.txt'; y = foreach x generate flatten (TOKENIZE ((chararray) $0)) as word; z = group y by word; counter = foreach z generate group, COUNT(y); store counter into ‘/NewPigData/WordCount’;
  • 41.
  • 42. Another Example i/p: webcount en 70 2012 en 60 2013 us 80 2012 en 40 2014 us 80 2012 records = LOAD ‘webcount’ using PigStorage (‘t’) as (country:chararray, name:chararray, pagecount:int, year:int); filtered_records = filter records by country == ‘en’; grouped_records = group filtered_records by name; results = foreach grouped_records generate group, SUM (filtered_records.pagecount); sorted_result = order results by $1 desc; store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive external table path
  • 43. Find Maximum Score i/p: CricketScore.txt a = load '/user/cloudera/SampleDataFile/CricketScore.txt' using PigStorage('t'); b = foreach a generate $0, $1; c = group b by $0; d = foreach c generate group, max(b.$1); dump d;
  • 44. Sorting Data Relations are unordered in Pig. Consider a relation A: • grunt> DUMP A; • (2,3) • (1,2) • (2,4) There is no guarantee which order the rows will be processed in. In particular, when retrieving the contents of A using DUMP or STORE, the rows may be written in any order. If you want to impose an order on the output, you can use the ORDER operator to sort a relation by one or more fields. The following example sorts A by the first field in ascending order and by the second field in descending order: grunt> B = ORDER A BY $0, $1 DESC; grunt> DUMP B; • (1,2) • (2,4) • (2,3) Any further processing on a sorted relation is not guaranteed to retain its order.
  • 45. Using Hive tables with HCatalog • HCatalog (which is a component of Hive) provides access to Hive’s metastore, so that Pig queries can reference schemas each time. • For example, after running through An Example to load data into a Hive table called records, Pig can access the table’s schema and data as follows: • pig -useHCatalog • grunt> records = LOAD ‘School_db.student_tbl' USING org.apache.hcatalog.pig.HCatLoader(); • grunt> DESCRIBE records; • grunt> DUMP records;
  • 46. PIG UDFs Pig provides extensive support for user defined functions (UDFs) to specify custom processing. REGISTER - Registers the JAR file with PIG runtime. REGISTER myudfs.jar; //JAR file should be available in local LINUX. A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B;
  • 47. UDF Sample Program package myudfs; import; import org.apache.pig.EvalFunc; import; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); } catch(Exception e){ throw new IOException("Caught exception processing input row ", e); } } } • (Pig’s Java UDF extends functionalities of EvalFunc)
  • 48. Diagnostic operator • DESCRIBE: Prints a relation’s schema. • EXPLAIN: Prints the logical and physical plans. • ILLUSTRATE: Shows a sample execution of the logical plan, using a generated subset of the input.
  • 49. Performance Tuning Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: • Project Early and Often – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = join A by t, B by x; – D = group C by u; – E = foreach D generate group, COUNT($1); • There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. – A = load 'myfile' as (t, u, v); – A1 = foreach A generate t, u; – B = load 'myotherfile' as (x, y, z); – B1 = foreach B generate x; – C = join A1 by t, B1 by x; – C1 = foreach C generate t, u; – D = group C1 by u; – E = foreach D generate group, COUNT($1);
  • 50. Performance Tuning As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. • Filter Early and Often -- Query 1 – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = filter A by t == 1; – D = join C by t, B by x; – E = group D by u; – F = foreach E generate group, COUNT($1); -- Query 2 – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = join A by t, B by x; – D = group C by u; – E = foreach D generate group, COUNT($1); – F = filter E by C.t == 1; • The first query is clearly more efficient than the second one because it reduces the amount of data going into the join.
  • 51. Performance Tuning Often you are not interested in the entire output but rather a sample or top results. In such cases, LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. • Use the LIMIT Operator – A = load 'myfile' as (t, u, v); – B = order A by t; – C = limit B 500;
  • 52. Performance Tuning If types are not specified in the load statement, Pig assumes the type of double for numeric computations. A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with speed of arithmetic computation. • Use Types – --Query 1 • A = load 'myfile' as (t, u, v); • B = foreach A generate t + u; – --Query 2 • A = load 'myfile' as (t: int, u: int, v); • B = foreach A generate t + u; • The second query will run more efficiently than the first. In some of our queries with see 2x speedup.
  • 53. Performance Tuning • Use Joins appropriately. – Understand Skewed Vs. Replicated vs. Merge join. • Remove null values before join. – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = join A by t, B by x; • is rewritten by Pig to – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C1 = cogroup A by t INNER, B by x INNER; – C = foreach C1 generate flatten(A), flatten(B); Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. But they will not be dropped until the last possible moment.
  • 54. Performance Tuning • Hence the previous query should be rewritten as – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – A1 = filter A by t is not null; – B1 = filter B by x is not null; – C = join A1 by t, B1 by x; Now nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant.
  • 55. Performance Tuning • You can set the number of reduce tasks for the MapReduce jobs generated by Pig using parallel reducer feature. – set default parallel command is used at the script level. • In this example all the MapReduce jobs gets launched use 20 reducers. – SET default_parallel 20; – A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v); – B = GROUP A BY t; – C = FOREACH B GENERATE group, COUNT(A.t) as mycount; – D = ORDER C BY mycount; – PARALLEL clause can be used with any operator like group, cogroup, join, order by, distinct that starts reduce phase.
  • 56. Replicated Join • One of the datasets is small enough that it fits in the memory. • A replicated join copies the small dataset to the distributed cache - space that is available on every cluster machine - and loads it into the memory. • Coz the data is available in the memory(DC), and is processed on the map side of MapReduce, this operation works much faster than a default join.
  • 57. • Limitations It isn’t clear how small the dataset needs to be for using replicated join. According to the Pig documentation, a relation of up to 100 MB can be used when the process has 1 GB of memory. A run-time error will be generated if not enough memory is available for loading the data.
  • 58. • transactions = load 'customer_transactions' as ( fname, lname, city, state, country, amount, tax); • geography = load 'geo_data' as (state, country, district, manager); Regular join • sales = join transactions by (state, country), geography by (state, country); • sales = join transactions by (state, country), geography by (state, country) using 'replicated';
  • 59. Skewed Join • One of the keys is much more common than others, and the data for it is too large to fit in the memory. • Standard joins run in parallel across different reducers by splitting key values across processes. If there is a lot of data for a certain key, the data will not be distributed evenly across the reducers, and one of them will be ‘stuck’ processing the majority of data. • Skewed join handles this case. It calculates a histogram to check which key is the most prevalent and then splits its data across different reducers for optimal performance.
  • 60. • transactions = load 'customer_transactions' as ( fname, lname, city, state, country, amount, tax); • geography = load 'geo_data' as (state, country, district, manager); • sales = join transactions by (state, country), geography by (state, country) using 'skewed';
  • 61. Merge Join • The two datasets are both sorted in ascending order by the join key. • Datasets may already be sorted by the join key if that’s the order in which data was entered or they have undergone sorting before the join operation for other needs. • When merge join receives the pre-sorted datasets, they are read and compared on the map side, and as a result they run faster. Both inner and outer join are available. •
  • 62. • transactions = load 'customer_transactions' as ( fname, lname, city, state, country, amount, tax); • geography = load 'geo_data' as (state, country, district, manager); • sales = join transactions by (state, country), geography by (state, country) using 'merge';

Editor's Notes

  1. Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application.
  2. As the example is written, this job will requires both a Map & Reduce job to successfully make the join work which leads to larger and larger inefficiency as the customer data set grows in size. This is the exact scenario that is optimized by using a Replicated join. The replicated join, tells Pig to distribute the geography set to each node, where it can be join directly in the Map job and eliminates the need for the Reduce job altogether. 
  3. Skewed join supports both inner and outer join, though only with two inputs - joins between additional tables should be broken up into further joins. Also, there is a pig.skwedjoin.reduce.memusage Java parameter that specifies the heap fraction available to reducers in order to perform this join. Setting a low value means more reducers will be used, yet the cost of copying the data across them will increase. Pig’s developers claim to have good performance when setting it between 0.1-0.4,