2. Agenda
What Is Pig? And what it is use for?
Pig Philosophy
Pig’s Data Model
Pig Example
Pig Latin
Pig Latin vs SQL
Pig Macros
Pig UDF’s
3. What Is Pig? And what it is
use for?
Pig has a pig engine which is used for executing data flows in parallel like how
map tasks are distributed among the cluster nodes and get job done in
Mapreduce .
Pig uses its own Pig Latin language for expressing these data flows.
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.
By default, Pig reads input files from HDFS, uses HDFS to store intermediate data
between MapReduce jobs, and writes its output to HDFS.
Pig Latin use cases tend to fall into three separate categories: traditional extract
transform load (ETL) data pipelines, research on raw data, and iterative
processing.
4. Pig Philosophy
Pigs eat anything
Pigs live anywhere
Pigs are domestic animals
Pigs fly
5. Pig’s Data Model : Types
Pig’s data types can be divided into two categories: scalar types, which
contain a single value, and complex types, which contain other types
6. Pig’s Data Model : Schemas
If a schema for the data is available, Pig will make use of it, both for up-front error
checking and for optimization
Syntax: Loads= load ‘data.txt' as(col1:int, col2:chararray, col3:chararray,
col4:float);
It is also possible to specify the schema without giving explicit data types. In this case,
the data type is assumed to be bytearray
Syntax: Loads= load ‘data.txt' as(col1, col2, col3, col4);
7. Pig Example
Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and
provides a shell for users to interact with HDFS.
To enter Grunt, pig -x local, pig -x mapreduce, pig -x tez
records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray,
temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4,
5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
11. Pig Macros
Macros provide a way to package reusable pieces of Pig Latin code from within
Pig Latin itself.
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int,
quality:int);
filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);
max_temp = max_by_group(filtered_records, year, temperature);
DUMP max_temp;