SlideShare a Scribd company logo
1 of 62
Pig - A data flow language and execution environment
for exploring very large datasets.
Pig runs on HDFS and MapReduce clusters.
• Pig is made up of two pieces:
• The language used to express data flows, called
Pig Latin.
• The execution environment to run Pig Latin
programs. There are currently two environments:
local execution in a single JVM and distributed
execution on a Hadoop cluster.
• Pig latin: The best of both SQL and Map-Reduce
combined with high-level declarative querying with low-level
procedural programming.
Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-
pagerank urls in that category
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category,
Differences between Pig Latin and RDBMS
Pig Latin
•Pig Latin is a data flow programming
•Pig Latin program is a step-by-step set of
operations on an input relation, in which
each step is a single transformation.
•It will operate on any source of tuples (the
source should support being read in parallel,
by being in multiple files, for example)
where a UDF is used to read the tuples from
their raw representation. The most common
representation is a text file with tab-
separated fields, and Pig provides a built-in
load function for this format. We can define a
schema at runtime, but it’s optional.
• There is no data import process to load the
data. The data is loaded from the filesystem
(usually HDFS) as the first step in the
•SQL is a declarative programming language
•SQL statements are a set of constraints
that, taken together, define the output.
•RDBMSs store data in tables, with tightly
predefined schemas.
• There is data import process to load the
data into the RDBMS.
Differences between Pig Latin and RDBMS
Pig Latin
•Pig supports complex, nested data structures.
•Pig does not support random reads or queries
in the order of tens of milliseconds. Nor does it
support random writes to update small portions
of data; all writes are bulk, streaming
writes, just like MapReduce.
• Pig’s ability to use UDFs and streaming
operators that are tightly integrated with the
language and Pig’s nested data structures
makes Pig Latin more customizable than most
SQL dialects.
•Pig does not have features to support online,
low-latency queries such as transactions and
•Operates on flatter data structures.
• Supportsrandom reads or queries
• Not very customizable.
•RDBMSs have features to support
online, low-latency queries such as
transactions and indexes.
 Pig Latin (The Language)
 Data Structures
 Commands
 Pig (The Compiler)
 Logical & Physical Plans
 Optimization
 Efficiency
 Grunt (The Interpreter)
 Pig Pen (The Debugger)
Big Picture
Pig Latin
Write Results Read Data
Data Model
 Atom - simple atomic value (ie: number or string)
 Tuple
 Bag
 Map
Data Model
 Atom
 Tuple - sequence of fields; each field any type
 Bag
 Map
Data Model
 Atom
 Tuple
 Bag - collection of tuples
 Duplicates possible
 Tuples in a bag can have different field lengths and field types
 Map
Data Model
 Atom
 Tuple
 Bag
 Map - collection of key-value pairs
 Key is an atom; value can be any type
Data Model
 Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested
 More natural for procedural programmers (target user) than
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions
 No schema required
Data Model
 User-Defined Functions (UDFs)
 Can be used in many Pig Latin statements
 Useful for custom processing tasks
 Can use non-atomic values for input and output
 Currently must be written in Java
 Ex: spam_urls = FILTER urls BY isSpam(url);
 Pig system does two tasks:
 Builds a Logical Plan from a Pig Latin script
 Supports execution platform independence
 No processing of data performed at this stage
 Compiles the Logical Plan to a Physical Plan and Executes
 Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
 Building a Logical Plan
 Verify input files and bags referred to are valid
 Create a logical plan for each bag(variable) defined
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
 Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Only happens when output is
specified by STORE or DUMP
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
 Step 2: Push other commands into the
map and reduce functions where Map
 May be the case certain commands
require their own map-reduce
job (ie: ORDER needs separate map-
reduce jobs)
Syntax and Semantics of Pig Latin
• Structure
• Pig Latin program consists of a collection of statements. A statement can be
thought of as an operation, or a command.
(e.g.) grouped_records = GROUP records BY year;
ls /
• Statements are usually terminated with a semicolon
• Pig Latin has two forms of comments.
-- My program
DUMP A; -- What's in A?
• C-style comments are more flexible since they delimit the beginning and end
of the
comment block with /* and */ markers. They can span lines or be embedded in a
* Description of my program spanning
* multiple lines.
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
Syntax and Semantics of Pig Latin
Pig Latin has a list of keywords that have a special meaning in
the language and cannot be used as identifiers.
(e.g.) operators (LOAD, ILLUSTRATE)
commands (cat, ls)
expressions (matches, FLATTEN)
and functions (DIFF, MAX)
Pig Latin has mixed rules on case sensitivity.
Operators and commands are not casesensitive (to
make interactive use more forgiving
Aliases and function names are case-sensitive.
-- max_temp.pig: Finds the maximum temperature by
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
(quality == 0 OR quality == 1 OR quality == 4 OR quality
== 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE
DUMP max_temp;
Syntax and Semantics of Pig Latin
Syntax and Semantics of Pig Latin
•When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct, and adds
it to the logical plan, but it does not load the data from the file (or even check
whether the file exists).
•Pig validates the GROUP and FOREACH...GENERATE statements, and adds
them to the logical plan without executing them.
• The trigger for Pig to start execution is the DUMP statement. At that point,
the logical plan is compiled into a physical plan and executed.
• The physical plan that Pig prepares is a series of MapReduce jobs, which in
local mode Pig runs in the local JVM, and in MapReduce mode Pig runs on a
Hadoop cluster.
• We can see the logical and physical plans created by Pig using the EXPLAIN
command on a relation (EXPLAIN max_temp; for example).
• EXPLAIN will also show the MapReduce plan, which shows how the physical
operators are grouped into MapReduce jobs.
• This is a good way to find out how many MapReduce jobs Pig will run for your
Multiquery execution
Since DUMP is a diagnostic tool, it will always trigger execution. However, the STORE
command is different. In batch mode, Pig will parse the whole script to see if there are
any optimizations that could be made to limit the amount of data to be written to or
read from disk. Consider the following simple example:
A = LOAD 'input/pig/multiquery/A';
B = FILTER A BY $1 == 'banana';
C = FILTER A BY $1 != 'banana';
STORE B INTO 'output/b';
STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save reading A twice, Pig can
run this script as a single MapReduce job by reading A once and writing two
output files from the job, one for each of B and C. This feature is called
multiquery execution.
The physical plan that Pig prepares is a series of MapReduce jobs, which in local
mode Pig runs in the local JVM, and in MapReduce mode Pig runs on a Hadoop
Pig Latin relational operators
Pig Latin diagnostic operators
Pig Latin macro and UDF statements
Pig commands to interact with Hadoop
Pig Latin expressions
Pig Latin expressions
Pig Latin types
Pig provides built-in functions TOTUPLE, TOBAG and TOMAP, which are used for
turning expressions into tuples, bags and maps.
A relation in Pig may have an associated schema, which gives the fields in the relation
names and types. AS clause in a LOAD statement is used to attaches a schema to a
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:int, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: int,temperature: int,quality: int}
It’s possible to omit type declarations completely, too:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature, quality);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: bytearray,quality: bytearray}
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: int,quality: int}
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';
grunt> DESCRIBE records;
Schema for records unknown.
Fields in a relation with no schema can be referenced only using positional notation:
$0 refers to the first field in a relation, $1 to the second, and so on. Their types
to bytearray:
grunt> projected_records = FOREACH records GENERATE $0, $1, $2;
grunt> DUMP projected_records;
grunt> DESCRIBE projected_records;
projected_records: {bytearray,bytearray,bytearray}
Validation and nulls
An SQL database will enforce the constraints in a table’s schema at load time: for
example, trying to load a string into a column that is declared to be a numeric type
will fail. In Pig, if the value cannot be cast to the type declared in the schema, then it
will substitute a null value. Let’s see how this works if we have the following input for
the weather data, which has an “e” character in place of an integer:
1950 0 1
1950 22 1
1950 e 1
1949 111 1
1949 78 1
Pig handles the corrupt line by producing a null for the offending value, which is
as the absence of a value when dumped to screen (and also when saved using
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
Pig produces a warning for the invalid field (not shown here), but does not halt its
processing. For large datasets, it is very common to have corrupt, invalid, or merely
unexpected data, and it is generally infeasible to incrementally fix every unparsable
record. Instead, we can pull out all of the invalid records in one go, so we can take
action on them, perhaps by fixing our program (because they indicate we have made
a mistake) or by filtering them out (because the data is genuinely unusable):
grunt> corrupt_records = FILTER records BY temperature is null;
grunt> DUMP corrupt_records;
Note the use of the is null operator, which is analogous to SQL.
We can find the number of corrupt records using the following idiom for counting the
number of rows in a relation:
grunt> grouped = GROUP corrupt_records ALL;
grunt> all_grouped = FOREACH grouped GENERATE group,
grunt> DUMP all_grouped;
Validation and nulls
Another useful technique is to use the SPLIT operator to partition the data into
and “bad” relations, which can then be analyzed separately:
grunt> SPLIT records INTO good_records IF temperature is not null,
>> bad_records IF temperature is null;
grunt> DUMP good_records;
grunt> DUMP bad_records;
Validation and nulls
Validation and nulls
If temperature’s type was left undeclared, the corrupt data cannot be
easily detected, since it doesn’t surface as a null:
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature, quality:int);
grunt> DUMP records;
grunt> filtered_records = FILTER records BY temperature != 9999
>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR
quality == 9);
grunt> grouped_records = GROUP filtered_records BY year;
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
grunt> DUMP max_temp;
Validation and nulls
What happens in this case is that the temperature field is interpreted as a
bytearray, so the corrupt field is not detected when the input is loaded. When
passed to the MAX function, the temperature field is cast to a double, since
MAX works only with numeric types. The corrupt field can not be represented as
a double, so it becomes a null, which MAX silently ignores. The best approach
is generally to declare types for our data on loading, and look for missing or
corrupt values in the relations themselves before we do our main processing.
Sometimes corrupt data shows up as smaller tuples since fields
are simply missing. You
can filter these out by using the SIZE function as follows:
grunt> A = LOAD 'input/pig/corrupt/missing_fields';
grunt> DUMP A;
grunt> B = FILTER A BY SIZE(TOTUPLE(*)) > 1;
grunt> DUMP B;
Eval function
A function that takes one or more expressions and returns another expression.
e.g. of a built-in eval function is MAX, which returns the maximum value of the entries in
a bag. Some eval functions are aggregate functions, which means they operate on a
bag of data to produce a scalar value; MAX is an example of an aggregate function.
Furthermore, many aggregate functions are algebraic, which means that the result of the
function may be calculated incrementally. In MapReduce terms, algebraic functions
make use of the combiner and are much more efficient to calculate MAX is an algebraic
function, whereas a function to calculate the median of a collection of values is an
example of a function that is not algebraic.
Filter function
A special type of eval function that returns a logical boolean result. Filter functions are
used in the FILTER operator to remove unwanted rows. They can also be used in other
relational operators that take boolean conditions and, in general, expressions using
boolean or conditional expressions. An example of a built-in filter function is IsEmpty,
which tests whether a bag or a map contains any items.
Load function
A function that specifies how to load data into a relation from external storage.
Store function
A function that specifies how to save the contents of a relation to external storage. Load
and store functions are implemented by the same type. For example, PigStorage, which
loads data from delimited text files, can store data in the same format.
Pig’s built-in functions
Pig’s built-in functions
Macros provide a way to package reusable pieces of Pig Latin code from within Pig
Latin itself. For example, we can extract the part of our Pig Latin program that
performs grouping on a relation then finds the maximum value in each group, by
defining a macro as follows:
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
The macro, called max_by_group, takes three parameters: a relation, X, and two
field names, group_key and max_field. It returns a single relation, Y. Within the
macro body, parameters and return aliases are referenced with a $ prefix, such as
The macro is used as follows:
records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray,
temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
max_temp = max_by_group(filtered_records, year, temperature);
DUMP max_temp
At runtime, Pig will expand the macro using the macro definition. After
expansion, the program looks like the following, with the expanded section in
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
macro_max_by_group_A_0 = GROUP filtered_records by (year);
max_temp = FOREACH macro_max_by_group_A_0 GENERATE group,
DUMP max_temp
To help reuse, macros can be defined in separate files to Pig scripts, in which
case they need to be imported into any script that uses them. An import
statement looks like this:
IMPORT './ch11/src/main/pig/max_temp.macro';
User-Defined Functions
(Filter UDF, Eval UDF, Load UDF)
An Eval UDF
Dynamic Invokers
grunt> DEFINE trim
InvokeForString('org.apache.commons.lang.StringUtils.trim', 'String');
grunt> B = FOREACH A GENERATE trim(fruit);
grunt> DUMP B;
The InvokeForString invoker is used since the return type of the method is a
String. (There are also InvokeForInt, InvokeForLong, InvokeForDouble, and
InvokeForFloat invokers.) The first argument to the invoker constructor is the
fully-qualified method to be invoked. The second is a space-separated list of the
method argument classes.
Data Processing Operators
Loading and Storing Data
We have seen how to load data
from external storage for
processing in Pig. Here’s an
example of using PigStorage to
store tuples as plain-text values
separated by a colon character:
grunt> STORE A INTO 'out'
USING PigStorage(':');
grunt> cat out
Filtering Data
Once we have some data loaded into a relation, the
next step is often to filter it to remove the data that
we are not interested in. By filtering early in the
processing pipeline, we minimize the amount of
data flowing through the system, which can improve
efficiency. The FOREACH...GENERATE operator is
used to act on every row in a relation. It can be used
to remove fields or to generate new ones.
In this example, we do both:
grunt> DUMP A;
grunt> B = FOREACH A GENERATE $0, $2+1,
grunt> DUMP B;
The STREAM operator allows us to transform data in a relation using an external
program or script. It is named by analogy with Hadoop Streaming, which provides a
similar capability for MapReduce.
STREAM can use built-in commands with arguments. Here is an example that
uses the Unix cut command to extract the second field of each tuple in A. Note that
the command and its arguments are enclosed in backticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
The STREAM operator uses PigStorage to serialize and deserialize relations to
and from the program’s standard input and output streams. Tuples in A are
converted to tabdelimited lines that are passed to the script. The output of the
script is read one line at a time and split on tabs to create new tuples for the output
relation C.
Grouping and Joining Data
Pig has very good built-in support for join operations. Since the large datasets that are
suitable for analysis by Pig (and MapReduce in general) are not normalized, joins are
used more infrequently in Pig than they are in SQL.
JOIN - Let’s look at an example of an inner join. Consider the relations
A and B:
grunt> DUMP A;
grunt> DUMP B;
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
Pig also supports outer joins using a syntax that is similar to
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
grunt> DUMP C;
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is
similar to JOIN, but creates a nested set of output tuples. This can be useful if you
want to exploit the structure in subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
COGROUP generates a tuple for each unique grouping key. The first field of each
tuple is the key, and the remaining fields are bags of tuples from the relations with a
matching key. The first bag contains the matching tuples from relation A with the
same key. Similarly, the second bag contains the matching tuples from relation B
with the same key.
If for a particular key a relation has no matching key, then the bag for that relation is
empty. For example, since no one has bought a scarf (with ID 1), the second bag in the
tuple for that row is empty. This is an example of an outer join, which is the default type
for COGROUP. It can be made explicit using the OUTER keyword, making this
COGROUP statement the same as the previous one:
You can suppress rows with empty bags by using the INNER keyword, which gives the
COGROUP inner join semantics. The INNER keyword is applied per relation, so the
following only suppresses rows when relation A has no match (dropping the unknown
product 0 here):
grunt> E = COGROUP A BY $0 INNER, B BY $1;
grunt> DUMP E;
We can flatten this structure to discover who bought each of the items in relation A:
grunt> DUMP F;
Combination of COGROUP, INNER, and FLATTEN
Using a combination of COGROUP, INNER, and FLATTEN (which removes
nesting) it’s possible to simulate an (inner) JOIN:
grunt> G = COGROUP A BY $0 INNER, B BY $1 INNER;
grunt> DUMP H;
This gives the same result as JOIN A BY $0, B BY $1.
Pig Latin includes the cross-product operator (also known as the cartesian product),
which joins every tuple in a relation with every tuple in a second relation (and with
every tuple in further relations if supplied). The size of the output is the product of the
size of the inputs, potentially making the output very large:
grunt> I = CROSS A, B;
grunt> DUMP I;
COGROUP groups the data in two or more relations, the GROUP statement
groups the data in a single relation. GROUP supports grouping by more than
equality of keys: you can use an expression or user-defined function as the
group key. For example, consider the following relation A:
grunt> DUMP A;
Let’s group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
GROUP creates a relation whose first field is the grouping field, which is
given the alias group. The second field is a bag containing the grouped fields
with the same schema as the original relation (in this case, A).
There are also two special grouping operations: ALL and ANY. ALL
groups all the tuples in a relation in a single group, as if the GROUP
function was a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
Note that there is no BY in this form of the GROUP statement. The ALL
grouping is commonly used to count the number of tuples in a relation
The ANY keyword is used to group the tuples in a relation randomly,
which can be
useful for sampling.
Sorting Data
Relations are unordered in Pig. Consider a relation A:
grunt> DUMP A;
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If we want to impose an order on the output, we can use the ORDER operator to
sort a relation by one or more fields. The default sort order compares fields of the
same type using the natural ordering, and different types are given an arbitrary, but
deterministic, ordering (a tuple is always “less than” a bag, for example). The following
example sorts A by the first field in ascending order and by the second field in
descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
Any further processing on a sorted relation is not guaranteed to retain its order. For
example: grunt> C = FOREACH B GENERATE *;
Even though relation C has the same contents as relation B, its tuples may be emitted
in any order by a DUMP or a STORE. It is for this reason that it is usual to perform the
ORDER operation just before retrieving the output.
The LIMIT statement is useful for limiting the number of results, as a quick and dirty way
to get a sample of a relation. It can be used immediately after the ORDER statement to
retrieve the first n tuples. Usually, LIMIT will select any n tuples from a relation, but when
used immediately after an ORDER statement, the order is retained (in an exception to
the rule that processing a relation does not retain its order):
grunt> D = LIMIT B 2;
grunt> DUMP D;
If the limit is greater than the number of tuples in the relation, all tuples are returned
(so LIMIT has no effect).
Using LIMIT can improve the performance of a query because Pig tries to apply the
limit as early as possible in the processing pipeline, to minimize the amount of data
that needs to be processed.
For this reason, we should always use LIMIT if we are not interested in the entire output.
Combining and Splitting Data
Sometimes you have several
relations that you would like to
combine into one. For this, the
UNION statement is used. For
grunt> DUMP A;
grunt> DUMP B;
grunt> C = UNION A, B;
grunt> DUMP C;
C is the union of relations A and B, and since
relations are unordered, the order of the tuples
in C is undefined. Also, it’s possible to form the
union of two relations with different schemas or
with different numbers of fields, as we have
done here. Pig attempts to merge the schemas
from the relations that UNION is operating on.
In this case, they are incompatible, so C has
no schema:
grunt> DESCRIBE A;
A: {f0: int,f1: int}
grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.
If the output relation has no schema, your
script needs to be able to handle tuples that
vary in the number of fields and/or types.
The SPLIT operator is the opposite of UNION; it partitions a relation into two or more
•When running in MapReduce mode it’s important that the degree of parallelism
matches the size of the dataset.
•By default, Pig will sets the number of reducers by looking at the size of the input,
and using one reducer per 1GB of input, up to a maximum of 999 reducers.
•We can override these parameters by setting pig.exec.reducers.bytes.per.reducer
(the default is 1000000000 bytes) and pig.exec.reducers.max (default 999).
•To explictly set the number of reducers we want for each job, we can use a
PARALLEL clause for operators that run in the reduce phase. These include all the
grouping and joining operators (GROUP, COGROUP, JOIN, CROSS), as well as
•The following line sets the number of reducers to 30 for the GROUP:
grouped_records = GROUP records BY year PARALLEL 30;
Alternatively, we can set the default_parallel option, and it will take effect for all
subsequent jobs:
grunt> set default_parallel 30
A good setting for the number of reduce tasks is slightly fewer than the number of
reduce slots in the cluster. The number of map tasks is set by the size of the input
(with one map per HDFS block) and is not affected by the PARALLEL clause.
Parameter Substitution
•If we have a Pig script that we run on a regular basis, then it’s quite common to
want to be able to run the same script with different parameters. For example, a
script that runs daily may use the date to determine which input files it runs over.
•Pig supports parameter substitution, where parameters in the script are substituted
with values supplied at runtime.
•Parameters are denoted by identifiers prefixed with a $ character; for example,
$input and $output are used in the following script to specify the input and output
-- max_temp_param.pig
records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
STORE max_temp into '$output';
Parameters can be specified when launching Pig, using the -param option, one for
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt 
> -param output=/tmp/out 
> ch11/src/main/pig/max_temp_param.pig
Parameter Substitution
We can also put parameters in a file and pass them to Pig using the -param_file
option. For example, we can achieve the same result as the previous command by
placing the parameter definitions in a file:
# Input file
# Output file
The pig invocation then becomes:
% pig -param_file ch11/src/main/pig/max_temp_param.param 
> ch11/src/main/pig/max_temp_param.pig
We can specify multiple parameter files using -param_file repeatedly. We can also use
a combination of -param and -param_file options, and if any parameter is defined in
both a parameter file and on the command line, the last value on the command line
takes precedence.
Dynamic parameters
For parameters that are supplied using the -param option, it is easy to make the value
dynamic by running a command or script. Many Unix shells support command
substitution for a command enclosed in backticks, and we can use this to make the
output directory date-based:
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt 
> -param output=/tmp/`date "+%Y-%m-%d"`/out 
> ch11/src/main/pig/max_temp_param.pig

More Related Content

Similar to Pig - A Data Flow Language and Execution Environment for Exploring Very Large Datasets

Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemRan Silberman
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemRan Silberman
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
Data Science
Data ScienceData Science
Data ScienceSubhajit75
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashantPrashant Gupta
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPCGlenn K. Lockwood
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand

Similar to Pig - A Data Flow Language and Execution Environment for Exploring Very Large Datasets (20)

Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
Pig latin
Pig latinPig latin
Pig latin
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Data Science
Data ScienceData Science
Data Science
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers

Recently uploaded

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank M.Gokilavani
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank M.Gokilavani
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh

Recently uploaded (20)

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝

Pig - A Data Flow Language and Execution Environment for Exploring Very Large Datasets

  • 1. Pig - A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. • Pig is made up of two pieces: • The language used to express data flows, called Pig Latin. • The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster. • Pig latin: The best of both SQL and Map-Reduce combined with high-level declarative querying with low-level procedural programming.
  • 2. Pig Latin Example Table urls: (url,category, pagerank) Find for each suffciently large category, the average pagerank of high- pagerank urls in that category SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
  • 3. Differences between Pig Latin and RDBMS Pig Latin •Pig Latin is a data flow programming language. •Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a single transformation. •It will operate on any source of tuples (the source should support being read in parallel, by being in multiple files, for example) where a UDF is used to read the tuples from their raw representation. The most common representation is a text file with tab- separated fields, and Pig provides a built-in load function for this format. We can define a schema at runtime, but it’s optional. • There is no data import process to load the data. The data is loaded from the filesystem (usually HDFS) as the first step in the processing. RDBMS •SQL is a declarative programming language •SQL statements are a set of constraints that, taken together, define the output. •RDBMSs store data in tables, with tightly predefined schemas. • There is data import process to load the data into the RDBMS.
  • 4. Differences between Pig Latin and RDBMS Pig Latin •Pig supports complex, nested data structures. •Pig does not support random reads or queries in the order of tens of milliseconds. Nor does it support random writes to update small portions of data; all writes are bulk, streaming writes, just like MapReduce. • Pig’s ability to use UDFs and streaming operators that are tightly integrated with the language and Pig’s nested data structures makes Pig Latin more customizable than most SQL dialects. •Pig does not have features to support online, low-latency queries such as transactions and indexes. RDBMS •Operates on flatter data structures. • Supportsrandom reads or queries • Not very customizable. •RDBMSs have features to support online, low-latency queries such as transactions and indexes.
  • 5.  Pig Latin (The Language)  Data Structures  Commands  Pig (The Compiler)  Logical & Physical Plans  Optimization  Efficiency  Grunt (The Interpreter)  Pig Pen (The Debugger)
  • 7. Data Model  Atom - simple atomic value (ie: number or string)  Tuple  Bag  Map 11
  • 8. Data Model  Atom  Tuple - sequence of fields; each field any type  Bag  Map 12
  • 9. Data Model  Atom  Tuple  Bag - collection of tuples  Duplicates possible  Tuples in a bag can have different field lengths and field types  Map 13
  • 10. Data Model  Atom  Tuple  Bag  Map - collection of key-value pairs  Key is an atom; value can be any type 14
  • 11. Data Model  Control over dataflow Ex 1 (less efficient) spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; Ex 2 (most efficient) highpgr_urls = FILTER urls BY pagerank > 0.8; spam_urls = FILTER highpgr_urls BY isSpam(url);  Fully nested  More natural for procedural programmers (target user) than normalization  Data is often stored on disk in a nested fashion  Facilitates ease of writing user-defined functions  No schema required 15
  • 12. Data Model  User-Defined Functions (UDFs)  Can be used in many Pig Latin statements  Useful for custom processing tasks  Can use non-atomic values for input and output  Currently must be written in Java 16  Ex: spam_urls = FILTER urls BY isSpam(url);
  • 13. Compilation  Pig system does two tasks:  Builds a Logical Plan from a Pig Latin script  Supports execution platform independence  No processing of data performed at this stage  Compiles the Logical Plan to a Physical Plan and Executes  Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce 24
  • 14. Compilation  Building a Logical Plan  Verify input files and bags referred to are valid  Create a logical plan for each bag(variable) defined 25
  • 15. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 26
  • 16. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 27
  • 17. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach 28
  • 18. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach Filter 29
  • 19. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Foreach 30
  • 20. Compilation  Building a Physical Plan A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Only happens when output is specified by STORE or DUMP Foreach 32
  • 21. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP Map Reduce Load(user.dat) Filter Group Foreach 33
  • 22. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP  Step 2: Push other commands into the map and reduce functions where Map possible  May be the case certain commands require their own map-reduce Reduce Load(user.dat) Filter Group job (ie: ORDER needs separate map- reduce jobs) Foreach 34
  • 23. Syntax and Semantics of Pig Latin • Structure • Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command. (e.g.) grouped_records = GROUP records BY year; ls / • Statements are usually terminated with a semicolon • Pig Latin has two forms of comments. -- My program DUMP A; -- What's in A? • C-style comments are more flexible since they delimit the beginning and end of the comment block with /* and */ markers. They can span lines or be embedded in a single line: /* * Description of my program spanning * multiple lines. */ A = LOAD 'input/pig/join/A'; B = LOAD 'input/pig/join/B'; C = JOIN A BY $0, /* ignored */ B BY $1; DUMP C;
  • 24. Syntax and Semantics of Pig Latin Keywords Pig Latin has a list of keywords that have a special meaning in the language and cannot be used as identifiers. (e.g.) operators (LOAD, ILLUSTRATE) commands (cat, ls) expressions (matches, FLATTEN) and functions (DIFF, MAX) Pig Latin has mixed rules on case sensitivity. Operators and commands are not casesensitive (to make interactive use more forgiving Aliases and function names are case-sensitive.
  • 25. -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp; Syntax and Semantics of Pig Latin
  • 26. Syntax and Semantics of Pig Latin •When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is syntactically and semantically correct, and adds it to the logical plan, but it does not load the data from the file (or even check whether the file exists). •Pig validates the GROUP and FOREACH...GENERATE statements, and adds them to the logical plan without executing them. • The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled into a physical plan and executed. • The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode Pig runs in the local JVM, and in MapReduce mode Pig runs on a Hadoop cluster. • We can see the logical and physical plans created by Pig using the EXPLAIN command on a relation (EXPLAIN max_temp; for example). • EXPLAIN will also show the MapReduce plan, which shows how the physical operators are grouped into MapReduce jobs. • This is a good way to find out how many MapReduce jobs Pig will run for your query.
  • 27. Multiquery execution Since DUMP is a diagnostic tool, it will always trigger execution. However, the STORE command is different. In batch mode, Pig will parse the whole script to see if there are any optimizations that could be made to limit the amount of data to be written to or read from disk. Consider the following simple example: A = LOAD 'input/pig/multiquery/A'; B = FILTER A BY $1 == 'banana'; C = FILTER A BY $1 != 'banana'; STORE B INTO 'output/b'; STORE C INTO 'output/c'; Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B and C. This feature is called multiquery execution. The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode Pig runs in the local JVM, and in MapReduce mode Pig runs on a Hadoop cluster.
  • 28. Pig Latin relational operators
  • 29. Pig Latin diagnostic operators Pig Latin macro and UDF statements
  • 30. Pig commands to interact with Hadoop filesystems
  • 33. Pig Latin types Pig provides built-in functions TOTUPLE, TOBAG and TOMAP, which are used for turning expressions into tuples, bags and maps.
  • 34. Schemas A relation in Pig may have an associated schema, which gives the fields in the relation names and types. AS clause in a LOAD statement is used to attaches a schema to a relation: grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' >> AS (year:int, temperature:int, quality:int); grunt> DESCRIBE records; records: {year: int,temperature: int,quality: int} It’s possible to omit type declarations completely, too: grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' >> AS (year, temperature, quality); grunt> DESCRIBE records; records: {year: bytearray,temperature: bytearray,quality: bytearray} grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' >> AS (year, temperature:int, quality:int); grunt> DESCRIBE records; records: {year: bytearray,temperature: int,quality: int} grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'; grunt> DESCRIBE records; Schema for records unknown.
  • 35. Fields in a relation with no schema can be referenced only using positional notation: $0 refers to the first field in a relation, $1 to the second, and so on. Their types default to bytearray: grunt> projected_records = FOREACH records GENERATE $0, $1, $2; grunt> DUMP projected_records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> DESCRIBE projected_records; projected_records: {bytearray,bytearray,bytearray} Schemas
  • 36. Validation and nulls An SQL database will enforce the constraints in a table’s schema at load time: for example, trying to load a string into a column that is declared to be a numeric type will fail. In Pig, if the value cannot be cast to the type declared in the schema, then it will substitute a null value. Let’s see how this works if we have the following input for the weather data, which has an “e” character in place of an integer: 1950 0 1 1950 22 1 1950 e 1 1949 111 1 1949 78 1 Pig handles the corrupt line by producing a null for the offending value, which is displayed as the absence of a value when dumped to screen (and also when saved using STORE): grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt' >> AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,,1) (1949,111,1) (1949,78,1)
  • 37. Pig produces a warning for the invalid field (not shown here), but does not halt its processing. For large datasets, it is very common to have corrupt, invalid, or merely unexpected data, and it is generally infeasible to incrementally fix every unparsable record. Instead, we can pull out all of the invalid records in one go, so we can take action on them, perhaps by fixing our program (because they indicate we have made a mistake) or by filtering them out (because the data is genuinely unusable): grunt> corrupt_records = FILTER records BY temperature is null; grunt> DUMP corrupt_records; (1950,,1) Note the use of the is null operator, which is analogous to SQL. We can find the number of corrupt records using the following idiom for counting the number of rows in a relation: grunt> grouped = GROUP corrupt_records ALL; grunt> all_grouped = FOREACH grouped GENERATE group, COUNT(corrupt_records); grunt> DUMP all_grouped; (all,1) Validation and nulls
  • 38. Another useful technique is to use the SPLIT operator to partition the data into “good” and “bad” relations, which can then be analyzed separately: grunt> SPLIT records INTO good_records IF temperature is not null, >> bad_records IF temperature is null; grunt> DUMP good_records; (1950,0,1) (1950,22,1) (1949,111,1) (1949,78,1) grunt> DUMP bad_records; (1950,,1) Validation and nulls
  • 39. Validation and nulls If temperature’s type was left undeclared, the corrupt data cannot be easily detected, since it doesn’t surface as a null: grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt' >> AS (year:chararray, temperature, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,e,1) (1949,111,1) (1949,78,1) grunt> filtered_records = FILTER records BY temperature != 9999 AND >> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grunt> grouped_records = GROUP filtered_records BY year; grunt> max_temp = FOREACH grouped_records GENERATE group, >> MAX(filtered_records.temperature); grunt> DUMP max_temp; (1949,111.0) (1950,22.0)
  • 40. Validation and nulls What happens in this case is that the temperature field is interpreted as a bytearray, so the corrupt field is not detected when the input is loaded. When passed to the MAX function, the temperature field is cast to a double, since MAX works only with numeric types. The corrupt field can not be represented as a double, so it becomes a null, which MAX silently ignores. The best approach is generally to declare types for our data on loading, and look for missing or corrupt values in the relations themselves before we do our main processing. Sometimes corrupt data shows up as smaller tuples since fields are simply missing. You can filter these out by using the SIZE function as follows: grunt> A = LOAD 'input/pig/corrupt/missing_fields'; grunt> DUMP A; (2,Tie) (4,Coat) (3) (1,Scarf) grunt> B = FILTER A BY SIZE(TOTUPLE(*)) > 1; grunt> DUMP B; (2,Tie) (4,Coat) (1,Scarf)
  • 41. Functions Eval function A function that takes one or more expressions and returns another expression. e.g. of a built-in eval function is MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate functions, which means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function. Furthermore, many aggregate functions are algebraic, which means that the result of the function may be calculated incrementally. In MapReduce terms, algebraic functions make use of the combiner and are much more efficient to calculate MAX is an algebraic function, whereas a function to calculate the median of a collection of values is an example of a function that is not algebraic. Filter function A special type of eval function that returns a logical boolean result. Filter functions are used in the FILTER operator to remove unwanted rows. They can also be used in other relational operators that take boolean conditions and, in general, expressions using boolean or conditional expressions. An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items. Load function A function that specifies how to load data into a relation from external storage. Store function A function that specifies how to save the contents of a relation to external storage. Load and store functions are implemented by the same type. For example, PigStorage, which loads data from delimited text files, can store data in the same format.
  • 44. Macros Macros provide a way to package reusable pieces of Pig Latin code from within Pig Latin itself. For example, we can extract the part of our Pig Latin program that performs grouping on a relation then finds the maximum value in each group, by defining a macro as follows: DEFINE max_by_group(X, group_key, max_field) RETURNS Y { A = GROUP $X by $group_key; $Y = FOREACH A GENERATE group, MAX($X.$max_field); }; The macro, called max_by_group, takes three parameters: a relation, X, and two field names, group_key and max_field. It returns a single relation, Y. Within the macro body, parameters and return aliases are referenced with a $ prefix, such as $X. The macro is used as follows: records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); max_temp = max_by_group(filtered_records, year, temperature); DUMP max_temp
  • 45. Macros At runtime, Pig will expand the macro using the macro definition. After expansion, the program looks like the following, with the expanded section in bold. records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); macro_max_by_group_A_0 = GROUP filtered_records by (year); max_temp = FOREACH macro_max_by_group_A_0 GENERATE group, MAX(filtered_records.(temperature)); DUMP max_temp To help reuse, macros can be defined in separate files to Pig scripts, in which case they need to be imported into any script that uses them. An import statement looks like this: IMPORT './ch11/src/main/pig/max_temp.macro';
  • 46. User-Defined Functions (Filter UDF, Eval UDF, Load UDF) An Eval UDF Dynamic Invokers grunt> DEFINE trim InvokeForString('org.apache.commons.lang.StringUtils.trim', 'String'); grunt> B = FOREACH A GENERATE trim(fruit); grunt> DUMP B; (pomegranate) (banana) (apple) (lychee) The InvokeForString invoker is used since the return type of the method is a String. (There are also InvokeForInt, InvokeForLong, InvokeForDouble, and InvokeForFloat invokers.) The first argument to the invoker constructor is the fully-qualified method to be invoked. The second is a space-separated list of the method argument classes.
  • 47. Data Processing Operators Loading and Storing Data We have seen how to load data from external storage for processing in Pig. Here’s an example of using PigStorage to store tuples as plain-text values separated by a colon character: grunt> STORE A INTO 'out' USING PigStorage(':'); grunt> cat out Joe:cherry:2 Ali:apple:3 Joe:banana:2 Eve:apple:7 Filtering Data Once we have some data loaded into a relation, the next step is often to filter it to remove the data that we are not interested in. By filtering early in the processing pipeline, we minimize the amount of data flowing through the system, which can improve efficiency. The FOREACH...GENERATE operator is used to act on every row in a relation. It can be used to remove fields or to generate new ones. In this example, we do both: grunt> DUMP A; (Joe,cherry,2) (Ali,apple,3) (Joe,banana,2) (Eve,apple,7) grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant'; grunt> DUMP B; (Joe,3,Constant) (Ali,4,Constant) (Joe,3,Constant) (Eve,8,Constant)
  • 48. STREAM The STREAM operator allows us to transform data in a relation using an external program or script. It is named by analogy with Hadoop Streaming, which provides a similar capability for MapReduce. STREAM can use built-in commands with arguments. Here is an example that uses the Unix cut command to extract the second field of each tuple in A. Note that the command and its arguments are enclosed in backticks: grunt> C = STREAM A THROUGH `cut -f 2`; grunt> DUMP C; (cherry) (apple) (banana) (apple) The STREAM operator uses PigStorage to serialize and deserialize relations to and from the program’s standard input and output streams. Tuples in A are converted to tabdelimited lines that are passed to the script. The output of the script is read one line at a time and split on tabs to create new tuples for the output relation C.
  • 49. Grouping and Joining Data Pig has very good built-in support for join operations. Since the large datasets that are suitable for analysis by Pig (and MapReduce in general) are not normalized, joins are used more infrequently in Pig than they are in SQL. JOIN - Let’s look at an example of an inner join. Consider the relations A and B: grunt> DUMP A; (2,Tie) (4,Coat) (3,Hat) (1,Scarf) grunt> DUMP B; (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2) We can join the two relations on the numerical (identity) field in each: grunt> C = JOIN A BY $0, B BY $1; grunt> DUMP C; (2,Tie,Joe,2) (2,Tie,Hank,2) (3,Hat,Eve,3) (4,Coat,Hank,4)
  • 50. Pig also supports outer joins using a syntax that is similar to SQL grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1; grunt> DUMP C; (1,Scarf,,) (2,Tie,Joe,2) (2,Tie,Hank,2) (3,Hat,Eve,3) (4,Coat,Hank,4)
  • 51. COGROUP JOIN always gives a flat structure: a set of tuples. The COGROUP statement is similar to JOIN, but creates a nested set of output tuples. This can be useful if you want to exploit the structure in subsequent statements: grunt> D = COGROUP A BY $0, B BY $1; grunt> DUMP D; (0,{},{(Ali,0)}) (1,{(1,Scarf)},{}) (2,{(2,Tie)},{(Joe,2),(Hank,2)}) (3,{(3,Hat)},{(Eve,3)}) (4,{(4,Coat)},{(Hank,4)}) COGROUP generates a tuple for each unique grouping key. The first field of each tuple is the key, and the remaining fields are bags of tuples from the relations with a matching key. The first bag contains the matching tuples from relation A with the same key. Similarly, the second bag contains the matching tuples from relation B with the same key.
  • 52. COGROUP If for a particular key a relation has no matching key, then the bag for that relation is empty. For example, since no one has bought a scarf (with ID 1), the second bag in the tuple for that row is empty. This is an example of an outer join, which is the default type for COGROUP. It can be made explicit using the OUTER keyword, making this COGROUP statement the same as the previous one: D = COGROUP A BY $0 OUTER, B BY $1 OUTER; You can suppress rows with empty bags by using the INNER keyword, which gives the COGROUP inner join semantics. The INNER keyword is applied per relation, so the following only suppresses rows when relation A has no match (dropping the unknown product 0 here): grunt> E = COGROUP A BY $0 INNER, B BY $1; grunt> DUMP E; (1,{(1,Scarf)},{}) (2,{(2,Tie)},{(Joe,2),(Hank,2)}) (3,{(3,Hat)},{(Eve,3)}) (4,{(4,Coat)},{(Hank,4)}) We can flatten this structure to discover who bought each of the items in relation A: grunt> F = FOREACH E GENERATE FLATTEN(A), B.$0; grunt> DUMP F; (1,Scarf,{}) (2,Tie,{(Joe),(Hank)}) (3,Hat,{(Eve)}) (4,Coat,{(Hank)})
  • 53. Combination of COGROUP, INNER, and FLATTEN Using a combination of COGROUP, INNER, and FLATTEN (which removes nesting) it’s possible to simulate an (inner) JOIN: grunt> G = COGROUP A BY $0 INNER, B BY $1 INNER; grunt> H = FOREACH G GENERATE FLATTEN($1), FLATTEN($2); grunt> DUMP H; (2,Tie,Joe,2) (2,Tie,Hank,2) (3,Hat,Eve,3) (4,Coat,Hank,4) This gives the same result as JOIN A BY $0, B BY $1.
  • 54. CROSS Pig Latin includes the cross-product operator (also known as the cartesian product), which joins every tuple in a relation with every tuple in a second relation (and with every tuple in further relations if supplied). The size of the output is the product of the size of the inputs, potentially making the output very large: grunt> I = CROSS A, B; grunt> DUMP I; (2,Tie,Joe,2) (2,Tie,Hank,4) (2,Tie,Ali,0) (2,Tie,Eve,3) (2,Tie,Hank,2) (4,Coat,Joe,2) (4,Coat,Hank,4) (4,Coat,Ali,0) (4,Coat,Eve,3) (4,Coat,Hank,2) (3,Hat,Joe,2) (3,Hat,Hank,4) (3,Hat,Ali,0) (3,Hat,Eve,3) (3,Hat,Hank,2) (1,Scarf,Joe,2) (1,Scarf,Hank,4) (1,Scarf,Ali,0) (1,Scarf,Eve,3) (1,Scarf,Hank,2)
  • 55. GROUP COGROUP groups the data in two or more relations, the GROUP statement groups the data in a single relation. GROUP supports grouping by more than equality of keys: you can use an expression or user-defined function as the group key. For example, consider the following relation A: grunt> DUMP A; (Joe,cherry) (Ali,apple) (Joe,banana) (Eve,apple) Let’s group by the number of characters in the second field: grunt> B = GROUP A BY SIZE($1); grunt> DUMP B; (5,{(Ali,apple),(Eve,apple)}) (6,{(Joe,cherry),(Joe,banana)}) GROUP creates a relation whose first field is the grouping field, which is given the alias group. The second field is a bag containing the grouped fields with the same schema as the original relation (in this case, A).
  • 56. GROUP There are also two special grouping operations: ALL and ANY. ALL groups all the tuples in a relation in a single group, as if the GROUP function was a constant: grunt> C = GROUP A ALL; grunt> DUMP C; (all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)}) Note that there is no BY in this form of the GROUP statement. The ALL grouping is commonly used to count the number of tuples in a relation The ANY keyword is used to group the tuples in a relation randomly, which can be useful for sampling.
  • 57. Sorting Data Relations are unordered in Pig. Consider a relation A: grunt> DUMP A; (2,3) (1,2) (2,4) There is no guarantee which order the rows will be processed in. In particular, when retrieving the contents of A using DUMP or STORE, the rows may be written in any order. If we want to impose an order on the output, we can use the ORDER operator to sort a relation by one or more fields. The default sort order compares fields of the same type using the natural ordering, and different types are given an arbitrary, but deterministic, ordering (a tuple is always “less than” a bag, for example). The following example sorts A by the first field in ascending order and by the second field in descending order: grunt> B = ORDER A BY $0, $1 DESC; grunt> DUMP B; (1,2) (2,4) (2,3) Any further processing on a sorted relation is not guaranteed to retain its order. For example: grunt> C = FOREACH B GENERATE *; Even though relation C has the same contents as relation B, its tuples may be emitted in any order by a DUMP or a STORE. It is for this reason that it is usual to perform the ORDER operation just before retrieving the output.
  • 58. Limit The LIMIT statement is useful for limiting the number of results, as a quick and dirty way to get a sample of a relation. It can be used immediately after the ORDER statement to retrieve the first n tuples. Usually, LIMIT will select any n tuples from a relation, but when used immediately after an ORDER statement, the order is retained (in an exception to the rule that processing a relation does not retain its order): grunt> D = LIMIT B 2; grunt> DUMP D; (1,2) (2,4) If the limit is greater than the number of tuples in the relation, all tuples are returned (so LIMIT has no effect). Using LIMIT can improve the performance of a query because Pig tries to apply the limit as early as possible in the processing pipeline, to minimize the amount of data that needs to be processed. For this reason, we should always use LIMIT if we are not interested in the entire output.
  • 59. Combining and Splitting Data Sometimes you have several relations that you would like to combine into one. For this, the UNION statement is used. For example: grunt> DUMP A; (2,3) (1,2) (2,4) grunt> DUMP B; (z,x,8) (w,y,1) grunt> C = UNION A, B; grunt> DUMP C; (2,3) (1,2) (2,4) (z,x,8) (w,y,1) C is the union of relations A and B, and since relations are unordered, the order of the tuples in C is undefined. Also, it’s possible to form the union of two relations with different schemas or with different numbers of fields, as we have done here. Pig attempts to merge the schemas from the relations that UNION is operating on. In this case, they are incompatible, so C has no schema: grunt> DESCRIBE A; A: {f0: int,f1: int} grunt> DESCRIBE B; B: {f0: chararray,f1: chararray,f2: int} grunt> DESCRIBE C; Schema for C unknown. If the output relation has no schema, your script needs to be able to handle tuples that vary in the number of fields and/or types. . The SPLIT operator is the opposite of UNION; it partitions a relation into two or more relations
  • 60. Parallelism •When running in MapReduce mode it’s important that the degree of parallelism matches the size of the dataset. •By default, Pig will sets the number of reducers by looking at the size of the input, and using one reducer per 1GB of input, up to a maximum of 999 reducers. •We can override these parameters by setting pig.exec.reducers.bytes.per.reducer (the default is 1000000000 bytes) and pig.exec.reducers.max (default 999). •To explictly set the number of reducers we want for each job, we can use a PARALLEL clause for operators that run in the reduce phase. These include all the grouping and joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and ORDER. •The following line sets the number of reducers to 30 for the GROUP: grouped_records = GROUP records BY year PARALLEL 30; Alternatively, we can set the default_parallel option, and it will take effect for all subsequent jobs: grunt> set default_parallel 30 A good setting for the number of reduce tasks is slightly fewer than the number of reduce slots in the cluster. The number of map tasks is set by the size of the input (with one map per HDFS block) and is not affected by the PARALLEL clause.
  • 61. Parameter Substitution •If we have a Pig script that we run on a regular basis, then it’s quite common to want to be able to run the same script with different parameters. For example, a script that runs daily may use the date to determine which input files it runs over. •Pig supports parameter substitution, where parameters in the script are substituted with values supplied at runtime. •Parameters are denoted by identifiers prefixed with a $ character; for example, $input and $output are used in the following script to specify the input and output paths: -- max_temp_param.pig records = LOAD '$input' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); STORE max_temp into '$output'; Parameters can be specified when launching Pig, using the -param option, one for each parameter: % pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt > -param output=/tmp/out > ch11/src/main/pig/max_temp_param.pig
  • 62. Parameter Substitution We can also put parameters in a file and pass them to Pig using the -param_file option. For example, we can achieve the same result as the previous command by placing the parameter definitions in a file: # Input file input=/user/tom/input/ncdc/micro-tab/sample.txt # Output file output=/tmp/out The pig invocation then becomes: % pig -param_file ch11/src/main/pig/max_temp_param.param > ch11/src/main/pig/max_temp_param.pig We can specify multiple parameter files using -param_file repeatedly. We can also use a combination of -param and -param_file options, and if any parameter is defined in both a parameter file and on the command line, the last value on the command line takes precedence. Dynamic parameters For parameters that are supplied using the -param option, it is easy to make the value dynamic by running a command or script. Many Unix shells support command substitution for a command enclosed in backticks, and we can use this to make the output directory date-based: % pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt > -param output=/tmp/`date "+%Y-%m-%d"`/out > ch11/src/main/pig/max_temp_param.pig