SlideShare a Scribd company logo
1 of 53
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
What is Pig
Provides abstraction for processing large datasets
A scripting language for exploring large datasets
Made up of two pieces:
• Pig Latin, a language used to express data flows
• The execution enviornment to run Pig Latin programs
 Local execution in a single JVM
 Distributed execution on Hadoop cluster
Provides much richer datasets, typically multivalued and
nested
Made up of series of operations or transformations, that are
applied to the input data to produce output
Under the covers, it turns the transformations into a series
of MapReduce jobs
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
What Pig is NOT
Not suitable for small data sets
In some cases, Pig doesn’t perform as well as
programs written in MapReduce
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Installing Pig
Step 1: make yourself root user
Step 2: Install pig
• $ yum –y install pig
Step 3: Install Hbase
• $ yum –y install hbase
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Installing Pig
Step 4: Set environment variables for pig
• $ vi ~/.bashrc
• Add the following lines to the end of the file
 export PIG_CONF_DIR=/usr/lib/pig/conf
 export PIG_CLASSPATH=/usr/lib/hbase/hbase-0.94.6-
cdh4.3.0-security.jar:/usr/lib/zookeeper/zookeeper-3.4.5-
cdh4.3.0.jar
Step 5: Configure Hadoop Properties
• Add the following in /usr/lib/pig/conf/pig.properties
 fs.default.name=hdfs://localhost/
 mapred.job.tracker=localhost:8021
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Execution Types
Local Mode
• Runs in a single JVM and accesses the local filesystem
• Suitable only for small datasets and when trying out Pig
• The execution type is set using the -x or -exectype
option
• To run in local mode:
 $ pig -x local
 This starts Grunt, the Pig interactive shell
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Execution Types
MapReduce Mode
• In MapReduce mode, Pig translates queries into
MapReduce jobs and runs them on a Hadoop cluster
• Used when you want to run Pig on large datasets
• Uses the fs.default.name and mapred.job.tracker
properties to connect to Hadoop
• To run in MapReduce Mode:
 $ pig
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Running Pig Programs
Script
• Run a script file that contains Pig commands
• $ pig script.pig
Grunt
• An interactive shell for running Pig commands
• You can run pig scripts from Grunt using run and exec
Embedded
• Run Pig programs from Java using PigServer class
• For programmatic access to Grunt, use PigRunner
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Running a sample Pig Job
Step 1: Go to user Joe
• $ su - joe
Step 2: Start the pig grunt command prompt
• $ pig
Step 3: Check if the directories from the previous
MR job exists
• grunt> ls
• It should display:
 hdfs://localhost:8020/user/joe/input <dir>
 hdfs://localhost:8020/user/joe/output <dir>
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Running a sample Pig Job
Step 4: Run the same MR job as a pig job
• grunt> A = LOAD 'input';
• grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*';
• grunt> DUMP B;
Step 5: Check the status of the job
• You can check the status of a job while it is running at
http://<virtual machine ip>:50030
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
An Example
Copy a sample file in HDFS
• $ su
• $ cat input/ncdc/micro-tab/sample.txt
• $ hadoop fs -copyFromLocal input/ncdc/micro-
tab/sample.txt .
Start the Grunt shell
• $ pig
Load the data from sample.txt into a datastore
• grunt> records = LOAD 'sample.txt' AS (year:chararray,
temperature:int, quality:int);
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
An Example
Filter out invalid records
• grunt> filtered_records = FILTER records BY temperature !=
9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR
quality == 5 OR quality == 9);
Group the filtered records on data column “year”
• grunt> grouped_records = GROUP filtered_records BY
year;
Get the max temperature for each group
• grunt> max_temp = FOREACH grouped_records GENERATE
group,MAX(filtered_records.temperature);
Get the output
• grunt> DUMP max_temp;
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Example Explained
Loading the data from sample.txt creates a
datastore with the following schema:
• records: {year: chararray,temperature: int,quality: int}
Actual data is loaded only when some related
datastore is DUMP’ed or STORE’ed
When a filter is applied a new datastore is created
from the records detastore with the same schema
but with filtered data
• filtered_records: {year: chararray,temperature:
int,quality: int}
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Example Explained
When a group is created for a particular data field,
for each unique value of that field a list of tuples is
created
The tuple list contains all data fields grouped
together
The length of that list is equal to the number of
occurrences of the group value in original data
The structure of the grouped_record dataset is
• grouped_records: {group: chararray,filtered_records:
{(year: chararray,temperature: int,quality: int)}}
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Example Explained
To get some aggregated value from a group you
should apply some aggregation on one or more of
the fields which were not a part of the group
The structure of the aggregated data set is
• max_temp: {group: chararray,int}
Now when you say DUMP max_temp all the actual
data loading, filtering, grouping, aggregation takes
place as MR jobs
The mappers are used for loading and filtering and
the reducers for grouping and aggregation
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Example Explained
records:
– (1950,0,1)
– (1950,22,1)
– (1950,-11,1)
– (1949,111,1)
– (1949,78,1)
filtered_records:
– (1950,0,1)
– (1950,22,1)
– (1950,-11,1)
– (1949,111,1)
– (1949,78,1)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Example Explained
grouped_records:
– (1949,{(1949,111,1),(1949,78,1)})
– (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
max_temp:
– (1949,111)
– (1950,22)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Generating examples
Creating a dataset to test your examples is a pain
Using random datasets doesn’t work as filter and
join operations tend to remove all random data
With ILLUSTRATE, Pig provides a reasonably
complete and concise sample dataset
• grunt> ILLUSTRATE max_temp;
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Generating examples
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Comparison with Databases
Relational Databases Pig
SQL statements are a set of
constraints that, taken together,
define the output
Pig Latin program is a step-by-step
transformation of input relation to
output
RDBMS store data in tables, with
tightly predefined schemas
In Pig you can define the schema
at runtime
SQL operates on flatter data
structures
Pig Latin supports nested data
structures
RDBMS have features to support
online, low latency queries, such
as transactions and indexes
Such features are absent in Pig
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Multiquery Execution
STORE and DUMP are different commands but act
same in interactive mode
DUMP is a diagnostic tool and always triggers
execution
STORE in batch mode triggers execution efficiently
Pig parses whole script in batch mode and checks
for any optimizations
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Multiquery Execution
• $ hadoop fs –copyFromLocal input/pig/multiquey/A .
• $ pig
• grunt> A = LOAD 'A';
• grunt> B = FILTER A BY $1 == 'banana';
• grunt> C = FILTER A BY $1 != 'banana';
• grunt> STORE B INTO 'output/b';
• grunt> STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save
reading A twice, Pig can run this script as a single
MapReduce job by reading A once and writing two
output files from the job, one for each of B and C.
This feature is called multiquery execution
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Multiquery Execution
$ hadoop fs –copyFromLocal
input/pig/multiquery/A .
$ hadoop fs –mkdir output
$ vi multiquery.pig
Copy the script from previous slide in the file
$ pig multiquery.pig
The output of the MR job will show only 1 mapper
and 0 reducers
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Latin Relational Operators
Category Operator Description
Loading and Sorting LOAD Load data from filesystem or other
storage into a relation
STORE Saves a relation to the filesystem or other
storage
DUMP Prints a relation to the console
Filtering FILTER Removes unwanted rows from a relation
DISTINCT Removes duplicate rows from a relation
FOREACH…GENERATE Adds or removes fields from a relation
MAPREDUCE Runs a MapReduce job using a relation as
input
STREAM Transforms a relation using an external
program
SAMPLE Selects a random sample of a relation
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Latin Relational Operators
Category Operator Description
Grouping and joining JOIN Joins two or more relations
COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS Creates the cross-product of two or more
relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation to a maximum
number of tuples
Combining and Splitting UNION Combines two or more relations into one
SPLIT Splits a relation into two or more relations
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Latin Diagnostic Operators
Operator Description
DESCRIBE Prints a relations schema
EXPLAIN Prints the logical and physical plans
ILLUSTRATE Shows a sample execution of the logical
plan, using a generated subset of the
input
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Latin macro and UDF statements
Operator Description
REGISTER Registers a JAR file with the Pig runtime
DEFINE Creates an alias for a macro, UDF,
streaming script, or command
specification
IMPORT Import macros defined in a separate file
into a script
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Latin Commands
Category Command Description
Hadoop Filesystem cat Prints the contents of one or more files
cd Changes the current directory
copyFromLocal Copies a local file or directory to a Hadoop filesystem
copyToLocal Copies a file or directory on a Hadoop filesystem to
the local filesystem
cp Copies a file or directory to another directory
fs Accesses Hadoop’s filesystem shell
ls Lists files
mkdir Creates a new directory
mv Moves a file or directory to another directory
pwd Prints the path of the current working directory
rm Deletes a file or directory
rmf Forcibly deletes a file or directory
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Pig Latin Commands
Category Command Description
Hadoop
MapReduce Utility
kill Kills a MapReduce job
exec Runs a script in a new Grunt shell in batch mode
help Shows the available commands and options
quit Exits the interpreter
run Runs a script within the existing Grunt shell
set Sets Pig options and MapReduce job properties
sh Run a shell command from within Grunt
debug Runs the job in debug mode. You can also set the
debug level
job.name Gives a pig job a meaningful name
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Expressions
Category Expressions Description Examples
Constant Literal Constant value 1.0, 'a'
Field (by position) $n Field in position n (zero-based) $0
Field (by name) f Field named f year
Field (disambiguate) r::f Field named f from relation r
after grouping
or joining
A::year
Projection c.$n, c.f Field in container c (relation, bag,
or tuple)
by position, by name
records.$0,
records.year
Map lookup m#k Value associated with key k in
map m
items#'Coat'
Cast (t) f Cast of field f to type t (int) year
Conditional x ? y : z Bincond/ternary; y if x evaluates
to true, z
Otherwise
quality == 0 ? 0 : 1
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Expressions
Category Expressions Description Examples
Arithmetic x + y, x – y
x * y, x / y
x % y
+x, -x
Addition, subtraction
Multiplication, division
Modulo, the remainder of x divided by y
Unary positive, negation
$1 + $2, $1 - $2
$1 * $2, $1 / $2
$1 % $2
+1, –1
Comparison x == y, x != y
x > y, x < y
x >= y, x <= y
x matches y
x is null
x is not null
Equals, does not equal
Greater than, less than
Greater or equal to, less or equal to
Pattern matching with regular exprn
Is null
Is not null
q == 0, t != 9999
q > 0, q < 10
q >= 1, q <=9
q matches '[01459]‘
t is null
t is not null
Boolean x or y
x and y
not x
Logical or
Logical and
Logical negation
q == 0 or q == 1
q == 0 and r == 0
not q matches
'[01459]'
Functional fn(f1,f2,…) Invocation of function fn on fields f1, f2,
etc.
isGood(quality)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Data Types
Category Type Description Examples
Numeric int
long
float
double
32-bit signed integer
64-bit signed integer
32-bit floating-point number
64-bit floating-point number
1
1L
1.0F
1.0
Text chararray Character array in UTF-16 format 'a’
Binary bytearray Byte array Not supported
Complex tuple
bag
map
Sequence of fields of any type
An unordered collection of tuples,
possibly with duplicates
A set of key-value pairs; keys must be
character arrays, but
values may be any type
(1,‘apple')
{(1,‘apple'),(2)}
['a'#‘apple']
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Schemas
Defining a right schema for a dataset is not mandatory
 grunt> records = LOAD 'sample.txt’ AS (year:int, temperature:int,
quality:int);
 grunt> DESCRIBE records;
– records: {year: int,temperature: int,quality: int}
 grunt> records = LOAD 'sample.txt’ AS (year:chararray,
temperature:int, quality:int);
 grunt> DESCRIBE records;
– records: {year: chararray,temperature: int,quality: int}
 grunt> records = LOAD 'sample.txt’ AS (year, temperature:int,
quality);
 grunt> DESCRIBE records;
– records: {year: bytearray,temperature: int,quality: bytearray}
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Schemas
Not defining the schema
• grunt> records = LOAD 'sample.txt';
• grunt> describe records;
 Schema for records unknown.
• grunt> dump records;
 (1950,0,1)
 (1950,22,1)
 (1950,-11,1)
 (1949,111,1)
 (1949,78,1)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Schemas
Creating dataset from a dataset whose schema is
undefined
• grunt> projected_records = FOREACH records GENERATE
$0, $1, $2;
• grunt> DUMP projected_records;
 (1950,0,1)
 (1950,22,1)
 (1950,-11,1)
 (1949,111,1)
 (1949,78,1)
• grunt> DESCRIBE projected_records;
 projected_records: {bytearray,bytearray,bytearray}
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Validations and nulls
SQL will fail if we try to load string into a numeric column
In Pig, if the value cannot be cast to the type declared in the
schema, it will substitute a null value
• $ hadoop fs -copyFromLocal input/ncdc/micro-
tab/sample_corrupt.txt .
• grunt> records = LOAD 'sample_corrupt.txt' AS (year:chararray,
temperature:int, quality:int);
• grunt> DUMP records;
 (1950,0,1)
 (1950,22,1)
 (1950,,1)
 (1949,111,1)
 (1949,78,1)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Validations and nulls
Filtering records on null values
• grunt> corrupt_records = FILTER records BY temperature is null;
• grunt> DUMP corrupt_records;
 (1950,,1)
• grunt> SPLIT records INTO good_records IF temperature is not
null, bad_records IF temperature is null;
• grunt> dump corrupt_records;
 (1950,,1)
• grunt> DUMP good_records;
 (1950,0,1)
 (1950,22,1)
 (1949,111,1)
 (1949,78,1)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Macros
Provide a way to package reusable pieces of Pig
Latin code
grunt> DEFINE max_by_group(X, group_key, max_field) RETURNS Y
{
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Macros
Using Macro
• grunt> records = LOAD 'sample.txt' AS (year:chararray, temperature:int,
quality:int);
• grunt> max_temp = max_by_group(records, year, temperature);
• grunt> DUMP max_temp;
 (1949,111)
 (1950,22)
Using Normal Pig Constructs
• grunt> macro_max_by_group_A_0 = GROUP records by (year);
• grunt> max_temp = FOREACH macro_max_by_group_A_0 GENERATE group,
MAX(records.(temperature));
• grunt> DUMP max_temp;
 (1949,111)
 (1950,22)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
UDF
UDF are User Defined Functions
It gives the ability to plug in custom code
We only cover Java UDF’s in this section
UDF’s can also be written in Python and JavaScript
Types of UDF:
• Filter UDF
• Eval UDF
• Load UDF
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Filter UDF
Filter UDF’s are UDF used as filter in Pig
The idea is to change this line
• filtered_records = FILTER records BY temperature != 9999 AND (quality
== 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
To
• filtered_records = FILTER records BY temperature != 9999 AND
isGood(quality);
Filter UDFs are all subclasses of FilterFunc, which itself is a
subclass of EvalFunc.
EvalFunc looks like the following class:
• public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
}
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Filter UDF
EvalFunc’s only abstract method, exec(), takes a
tuple and returns a single value, the
(parameterized) type T.
For FilterFunc, T is Boolean, so the method should
return true only for those tuples that should not be
filtered out.
package org.vmware.training;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.FuncSpec;
import
org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class IsGoodQuality extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}
}
}
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Lets write out the code
$ mkdir ~/UDF
$ mkdir ~/UDF/IsGoodQuality
$ mkdir
~/UDF/IsGoodQuality/IsGoodQualityClasses
$ cd ~/UDF/IsGoodQuality
$ vim IsGoodQuality.java
Copy the code from previous slide into it
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Lets compile the code and create a jar
Set the classpath
• $ export
CLASSPATH=/usr/lib/hadoop/*:/usr/lib/hadoop/client-
0.20/*:/usr/lib/pig/*:/usr/lib/pig/lib/*
Compile the java class
• $ javac -d IsGoodQualityClasses IsGoodQuality.java
Create a jar
• $ jar -cvf goodquality.jar -C IsGoodQualityClasses/ .
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Lets use the UDF in our code
Start the pig console
• $ cd
• $ pig
Register the jar file
• grunt> REGISTER /root/UDF/IsGoodQuality/goodquality.jar;
Use the UDF from the jar
• grunt> records = LOAD 'sample.txt' AS (year:chararray,
temperature:int, quality:int);
• grunt> filtered_records = FILTER records BY temperature !=
9999 AND org.vmware.training.IsGoodQuality(quality);
• grunt> DUMP filtered_records;
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Join
Prerequisite
• $ hadoop fs –copyFromLocal /root/input/pig/join/A .
• $ hadoop fs –copyFromLocal /root/input/pig/join/B .
• $ pig
• grunt> A = LOAD 'A' AS (id:int, item:chararray);
• grunt> B = LOAD 'B' AS (name:chararray, id:int);
• grunt> DUMP A;
 (2,Tie)
 (4,Coat)
 (3,Hat)
 (1,Scarf)
• grunt> DUMP B;
 (Joe,2)
 (Hank,4)
 (Ali,0)
 (Eve,3)
 (Hank,2)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Join
Inner Join
• grunt> C = JOIN A BY $0, B BY $1;
• grunt> DUMP C;
 (2,Tie,Joe,2)
 (2,Tie,Hank,2)
 (3,Hat,Eve,3)
 (4,Coat,Hank,4)
Outer Join
• grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
 (1,Scarf,,)
 (2,Tie,Joe,2)
 (2,Tie,Hank,2)
 (3,Hat,Eve,3)
 (4,Coat,Hank,4)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
COGROUP
Similar to JOIN
Creates nested set of output tuples rather than a
flat structure
• grunt> D = COGROUP A BY $0, B BY $1;
• grunt> DUMP D;
 (0,{},{(Ali,0)})
 (1,{(1,Scarf)},{})
 (2,{(2,Tie)},{(Joe,2),(Hank,2)})
 (3,{(3,Hat)},{(Eve,3)})
 (4,{(4,Coat)},{(Hank,4)})
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
CROSS
Joins every tuple in a relation with every tuple in a
second relation
• grunt> I = CROSS A, B;
• grunt> DUMP I;
 (2,Tie,Joe,2)
 (2,Tie,Hank,4)
 (2,Tie,Ali,0)
 (2,Tie,Eve,3)
 (2,Tie,Hank,2)
 (4,Coat,Joe,2)
 (4,Coat,Hank,4) ….
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
STREAM
The STREAM operator allows you to transform data in a relation
using an external program or script
It is named by analogy with Hadoop Streaming, which provides a
similar capability for MapReduce
• grunt> DUMP A;
 (2,Tie)
 (4,Coat)
 (3,Hat)
 (1,Scarf)
• grunt> C = STREAM A THROUGH `cut -f 2`;
• grunt> DUMP C;
 (Tie)
 (Coat)
 (Hat)
 (Scarf)
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
STREAM
The STREAM operator uses PigStorage to serialize
and deserialize relations to and from the program’s
standard input and output streams
Tuples in A are converted to tab delimited lines
that are passed to the script.
The output of the script is read one line at a time
and split on tabs to create new tuples for the
output relation C.

More Related Content

What's hot

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Event In JavaScript
Event In JavaScriptEvent In JavaScript
Event In JavaScriptShahDhruv21
 
Html text and formatting
Html text and formattingHtml text and formatting
Html text and formattingeShikshak
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksPradeep Kumar TS
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operationsanujaggarwal49
 
Open Source Ajax Solution @OSDC.tw 2009
Open Source Ajax  Solution @OSDC.tw 2009Open Source Ajax  Solution @OSDC.tw 2009
Open Source Ajax Solution @OSDC.tw 2009Robbie Cheng
 

What's hot (20)

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Channel access link
Channel access linkChannel access link
Channel access link
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Event In JavaScript
Event In JavaScriptEvent In JavaScript
Event In JavaScript
 
Html text and formatting
Html text and formattingHtml text and formatting
Html text and formatting
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy Networks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operations
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to XHTML
Introduction to XHTMLIntroduction to XHTML
Introduction to XHTML
 
Open Source Ajax Solution @OSDC.tw 2009
Open Source Ajax  Solution @OSDC.tw 2009Open Source Ajax  Solution @OSDC.tw 2009
Open Source Ajax Solution @OSDC.tw 2009
 

Similar to Hadoop Pig

Pig_Presentation
Pig_PresentationPig_Presentation
Pig_PresentationArjun Shah
 
How Secure Are Docker Containers?
How Secure Are Docker Containers?How Secure Are Docker Containers?
How Secure Are Docker Containers?Ben Hall
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformIMC Institute
 
2015-GopherCon-Talk-Uptime.pdf
2015-GopherCon-Talk-Uptime.pdf2015-GopherCon-Talk-Uptime.pdf
2015-GopherCon-Talk-Uptime.pdfUtabeUtabe
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and MaintenanceJazkarta, Inc.
 
Bluemix hadoop beginners Guide part I
Bluemix hadoop beginners Guide part IBluemix hadoop beginners Guide part I
Bluemix hadoop beginners Guide part IJoseph Chang
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
How to Design a Great API (using flask) [ploneconf2017]
How to Design a Great API (using flask) [ploneconf2017]How to Design a Great API (using flask) [ploneconf2017]
How to Design a Great API (using flask) [ploneconf2017]Devon Bernard
 
Filesystem Management with Flysystem - php[tek] 2023
Filesystem Management with Flysystem - php[tek] 2023Filesystem Management with Flysystem - php[tek] 2023
Filesystem Management with Flysystem - php[tek] 2023Mark Niebergall
 
Plack on SL4A in Yokohama.pm #8
Plack on SL4A in Yokohama.pm #8Plack on SL4A in Yokohama.pm #8
Plack on SL4A in Yokohama.pm #8Yoshiki Kurihara
 

Similar to Hadoop Pig (20)

Hadoop HDFS
Hadoop HDFS Hadoop HDFS
Hadoop HDFS
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_Presentation
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
How Secure Are Docker Containers?
How Secure Are Docker Containers?How Secure Are Docker Containers?
How Secure Are Docker Containers?
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
2015-GopherCon-Talk-Uptime.pdf
2015-GopherCon-Talk-Uptime.pdf2015-GopherCon-Talk-Uptime.pdf
2015-GopherCon-Talk-Uptime.pdf
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and Maintenance
 
Afl - 4fun and pr0fit
Afl - 4fun and pr0fitAfl - 4fun and pr0fit
Afl - 4fun and pr0fit
 
Bluemix hadoop beginners Guide part I
Bluemix hadoop beginners Guide part IBluemix hadoop beginners Guide part I
Bluemix hadoop beginners Guide part I
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
How to Design a Great API (using flask) [ploneconf2017]
How to Design a Great API (using flask) [ploneconf2017]How to Design a Great API (using flask) [ploneconf2017]
How to Design a Great API (using flask) [ploneconf2017]
 
Filesystem Management with Flysystem - php[tek] 2023
Filesystem Management with Flysystem - php[tek] 2023Filesystem Management with Flysystem - php[tek] 2023
Filesystem Management with Flysystem - php[tek] 2023
 
Plack on SL4A in Yokohama.pm #8
Plack on SL4A in Yokohama.pm #8Plack on SL4A in Yokohama.pm #8
Plack on SL4A in Yokohama.pm #8
 

Recently uploaded

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 

Recently uploaded (20)

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 

Hadoop Pig

  • 1. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 What is Pig Provides abstraction for processing large datasets A scripting language for exploring large datasets Made up of two pieces: • Pig Latin, a language used to express data flows • The execution enviornment to run Pig Latin programs  Local execution in a single JVM  Distributed execution on Hadoop cluster Provides much richer datasets, typically multivalued and nested Made up of series of operations or transformations, that are applied to the input data to produce output Under the covers, it turns the transformations into a series of MapReduce jobs
  • 2. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 What Pig is NOT Not suitable for small data sets In some cases, Pig doesn’t perform as well as programs written in MapReduce
  • 3. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
  • 4. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Installing Pig Step 1: make yourself root user Step 2: Install pig • $ yum –y install pig Step 3: Install Hbase • $ yum –y install hbase
  • 5. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Installing Pig Step 4: Set environment variables for pig • $ vi ~/.bashrc • Add the following lines to the end of the file  export PIG_CONF_DIR=/usr/lib/pig/conf  export PIG_CLASSPATH=/usr/lib/hbase/hbase-0.94.6- cdh4.3.0-security.jar:/usr/lib/zookeeper/zookeeper-3.4.5- cdh4.3.0.jar Step 5: Configure Hadoop Properties • Add the following in /usr/lib/pig/conf/pig.properties  fs.default.name=hdfs://localhost/  mapred.job.tracker=localhost:8021
  • 6. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Execution Types Local Mode • Runs in a single JVM and accesses the local filesystem • Suitable only for small datasets and when trying out Pig • The execution type is set using the -x or -exectype option • To run in local mode:  $ pig -x local  This starts Grunt, the Pig interactive shell
  • 7. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Execution Types MapReduce Mode • In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster • Used when you want to run Pig on large datasets • Uses the fs.default.name and mapred.job.tracker properties to connect to Hadoop • To run in MapReduce Mode:  $ pig
  • 8. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Running Pig Programs Script • Run a script file that contains Pig commands • $ pig script.pig Grunt • An interactive shell for running Pig commands • You can run pig scripts from Grunt using run and exec Embedded • Run Pig programs from Java using PigServer class • For programmatic access to Grunt, use PigRunner
  • 9. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Running a sample Pig Job Step 1: Go to user Joe • $ su - joe Step 2: Start the pig grunt command prompt • $ pig Step 3: Check if the directories from the previous MR job exists • grunt> ls • It should display:  hdfs://localhost:8020/user/joe/input <dir>  hdfs://localhost:8020/user/joe/output <dir>
  • 10. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Running a sample Pig Job Step 4: Run the same MR job as a pig job • grunt> A = LOAD 'input'; • grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*'; • grunt> DUMP B; Step 5: Check the status of the job • You can check the status of a job while it is running at http://<virtual machine ip>:50030
  • 11. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 An Example Copy a sample file in HDFS • $ su • $ cat input/ncdc/micro-tab/sample.txt • $ hadoop fs -copyFromLocal input/ncdc/micro- tab/sample.txt . Start the Grunt shell • $ pig Load the data from sample.txt into a datastore • grunt> records = LOAD 'sample.txt' AS (year:chararray, temperature:int, quality:int);
  • 12. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 An Example Filter out invalid records • grunt> filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); Group the filtered records on data column “year” • grunt> grouped_records = GROUP filtered_records BY year; Get the max temperature for each group • grunt> max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature); Get the output • grunt> DUMP max_temp;
  • 13. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Example Explained Loading the data from sample.txt creates a datastore with the following schema: • records: {year: chararray,temperature: int,quality: int} Actual data is loaded only when some related datastore is DUMP’ed or STORE’ed When a filter is applied a new datastore is created from the records detastore with the same schema but with filtered data • filtered_records: {year: chararray,temperature: int,quality: int}
  • 14. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Example Explained When a group is created for a particular data field, for each unique value of that field a list of tuples is created The tuple list contains all data fields grouped together The length of that list is equal to the number of occurrences of the group value in original data The structure of the grouped_record dataset is • grouped_records: {group: chararray,filtered_records: {(year: chararray,temperature: int,quality: int)}}
  • 15. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Example Explained To get some aggregated value from a group you should apply some aggregation on one or more of the fields which were not a part of the group The structure of the aggregated data set is • max_temp: {group: chararray,int} Now when you say DUMP max_temp all the actual data loading, filtering, grouping, aggregation takes place as MR jobs The mappers are used for loading and filtering and the reducers for grouping and aggregation
  • 16. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Example Explained records: – (1950,0,1) – (1950,22,1) – (1950,-11,1) – (1949,111,1) – (1949,78,1) filtered_records: – (1950,0,1) – (1950,22,1) – (1950,-11,1) – (1949,111,1) – (1949,78,1)
  • 17. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Example Explained grouped_records: – (1949,{(1949,111,1),(1949,78,1)}) – (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)}) max_temp: – (1949,111) – (1950,22)
  • 18. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Generating examples Creating a dataset to test your examples is a pain Using random datasets doesn’t work as filter and join operations tend to remove all random data With ILLUSTRATE, Pig provides a reasonably complete and concise sample dataset • grunt> ILLUSTRATE max_temp;
  • 19. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Generating examples
  • 20. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Comparison with Databases Relational Databases Pig SQL statements are a set of constraints that, taken together, define the output Pig Latin program is a step-by-step transformation of input relation to output RDBMS store data in tables, with tightly predefined schemas In Pig you can define the schema at runtime SQL operates on flatter data structures Pig Latin supports nested data structures RDBMS have features to support online, low latency queries, such as transactions and indexes Such features are absent in Pig
  • 21. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Multiquery Execution STORE and DUMP are different commands but act same in interactive mode DUMP is a diagnostic tool and always triggers execution STORE in batch mode triggers execution efficiently Pig parses whole script in batch mode and checks for any optimizations
  • 22. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Multiquery Execution • $ hadoop fs –copyFromLocal input/pig/multiquey/A . • $ pig • grunt> A = LOAD 'A'; • grunt> B = FILTER A BY $1 == 'banana'; • grunt> C = FILTER A BY $1 != 'banana'; • grunt> STORE B INTO 'output/b'; • grunt> STORE C INTO 'output/c'; Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B and C. This feature is called multiquery execution
  • 23. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Multiquery Execution $ hadoop fs –copyFromLocal input/pig/multiquery/A . $ hadoop fs –mkdir output $ vi multiquery.pig Copy the script from previous slide in the file $ pig multiquery.pig The output of the MR job will show only 1 mapper and 0 reducers
  • 24. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Latin Relational Operators Category Operator Description Loading and Sorting LOAD Load data from filesystem or other storage into a relation STORE Saves a relation to the filesystem or other storage DUMP Prints a relation to the console Filtering FILTER Removes unwanted rows from a relation DISTINCT Removes duplicate rows from a relation FOREACH…GENERATE Adds or removes fields from a relation MAPREDUCE Runs a MapReduce job using a relation as input STREAM Transforms a relation using an external program SAMPLE Selects a random sample of a relation
  • 25. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Latin Relational Operators Category Operator Description Grouping and joining JOIN Joins two or more relations COGROUP Groups the data in two or more relations GROUP Groups the data in a single relation CROSS Creates the cross-product of two or more relations Sorting ORDER Sorts a relation by one or more fields LIMIT Limits the size of a relation to a maximum number of tuples Combining and Splitting UNION Combines two or more relations into one SPLIT Splits a relation into two or more relations
  • 26. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Latin Diagnostic Operators Operator Description DESCRIBE Prints a relations schema EXPLAIN Prints the logical and physical plans ILLUSTRATE Shows a sample execution of the logical plan, using a generated subset of the input
  • 27. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Latin macro and UDF statements Operator Description REGISTER Registers a JAR file with the Pig runtime DEFINE Creates an alias for a macro, UDF, streaming script, or command specification IMPORT Import macros defined in a separate file into a script
  • 28. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Latin Commands Category Command Description Hadoop Filesystem cat Prints the contents of one or more files cd Changes the current directory copyFromLocal Copies a local file or directory to a Hadoop filesystem copyToLocal Copies a file or directory on a Hadoop filesystem to the local filesystem cp Copies a file or directory to another directory fs Accesses Hadoop’s filesystem shell ls Lists files mkdir Creates a new directory mv Moves a file or directory to another directory pwd Prints the path of the current working directory rm Deletes a file or directory rmf Forcibly deletes a file or directory
  • 29. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Pig Latin Commands Category Command Description Hadoop MapReduce Utility kill Kills a MapReduce job exec Runs a script in a new Grunt shell in batch mode help Shows the available commands and options quit Exits the interpreter run Runs a script within the existing Grunt shell set Sets Pig options and MapReduce job properties sh Run a shell command from within Grunt debug Runs the job in debug mode. You can also set the debug level job.name Gives a pig job a meaningful name
  • 30. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Expressions Category Expressions Description Examples Constant Literal Constant value 1.0, 'a' Field (by position) $n Field in position n (zero-based) $0 Field (by name) f Field named f year Field (disambiguate) r::f Field named f from relation r after grouping or joining A::year Projection c.$n, c.f Field in container c (relation, bag, or tuple) by position, by name records.$0, records.year Map lookup m#k Value associated with key k in map m items#'Coat' Cast (t) f Cast of field f to type t (int) year Conditional x ? y : z Bincond/ternary; y if x evaluates to true, z Otherwise quality == 0 ? 0 : 1
  • 31. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Expressions Category Expressions Description Examples Arithmetic x + y, x – y x * y, x / y x % y +x, -x Addition, subtraction Multiplication, division Modulo, the remainder of x divided by y Unary positive, negation $1 + $2, $1 - $2 $1 * $2, $1 / $2 $1 % $2 +1, –1 Comparison x == y, x != y x > y, x < y x >= y, x <= y x matches y x is null x is not null Equals, does not equal Greater than, less than Greater or equal to, less or equal to Pattern matching with regular exprn Is null Is not null q == 0, t != 9999 q > 0, q < 10 q >= 1, q <=9 q matches '[01459]‘ t is null t is not null Boolean x or y x and y not x Logical or Logical and Logical negation q == 0 or q == 1 q == 0 and r == 0 not q matches '[01459]' Functional fn(f1,f2,…) Invocation of function fn on fields f1, f2, etc. isGood(quality)
  • 32. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Data Types Category Type Description Examples Numeric int long float double 32-bit signed integer 64-bit signed integer 32-bit floating-point number 64-bit floating-point number 1 1L 1.0F 1.0 Text chararray Character array in UTF-16 format 'a’ Binary bytearray Byte array Not supported Complex tuple bag map Sequence of fields of any type An unordered collection of tuples, possibly with duplicates A set of key-value pairs; keys must be character arrays, but values may be any type (1,‘apple') {(1,‘apple'),(2)} ['a'#‘apple']
  • 33. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Schemas Defining a right schema for a dataset is not mandatory  grunt> records = LOAD 'sample.txt’ AS (year:int, temperature:int, quality:int);  grunt> DESCRIBE records; – records: {year: int,temperature: int,quality: int}  grunt> records = LOAD 'sample.txt’ AS (year:chararray, temperature:int, quality:int);  grunt> DESCRIBE records; – records: {year: chararray,temperature: int,quality: int}  grunt> records = LOAD 'sample.txt’ AS (year, temperature:int, quality);  grunt> DESCRIBE records; – records: {year: bytearray,temperature: int,quality: bytearray}
  • 34. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Schemas Not defining the schema • grunt> records = LOAD 'sample.txt'; • grunt> describe records;  Schema for records unknown. • grunt> dump records;  (1950,0,1)  (1950,22,1)  (1950,-11,1)  (1949,111,1)  (1949,78,1)
  • 35. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Schemas Creating dataset from a dataset whose schema is undefined • grunt> projected_records = FOREACH records GENERATE $0, $1, $2; • grunt> DUMP projected_records;  (1950,0,1)  (1950,22,1)  (1950,-11,1)  (1949,111,1)  (1949,78,1) • grunt> DESCRIBE projected_records;  projected_records: {bytearray,bytearray,bytearray}
  • 36. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Validations and nulls SQL will fail if we try to load string into a numeric column In Pig, if the value cannot be cast to the type declared in the schema, it will substitute a null value • $ hadoop fs -copyFromLocal input/ncdc/micro- tab/sample_corrupt.txt . • grunt> records = LOAD 'sample_corrupt.txt' AS (year:chararray, temperature:int, quality:int); • grunt> DUMP records;  (1950,0,1)  (1950,22,1)  (1950,,1)  (1949,111,1)  (1949,78,1)
  • 37. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Validations and nulls Filtering records on null values • grunt> corrupt_records = FILTER records BY temperature is null; • grunt> DUMP corrupt_records;  (1950,,1) • grunt> SPLIT records INTO good_records IF temperature is not null, bad_records IF temperature is null; • grunt> dump corrupt_records;  (1950,,1) • grunt> DUMP good_records;  (1950,0,1)  (1950,22,1)  (1949,111,1)  (1949,78,1)
  • 38. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Macros Provide a way to package reusable pieces of Pig Latin code grunt> DEFINE max_by_group(X, group_key, max_field) RETURNS Y { A = GROUP $X by $group_key; $Y = FOREACH A GENERATE group, MAX($X.$max_field); };
  • 39. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Macros Using Macro • grunt> records = LOAD 'sample.txt' AS (year:chararray, temperature:int, quality:int); • grunt> max_temp = max_by_group(records, year, temperature); • grunt> DUMP max_temp;  (1949,111)  (1950,22) Using Normal Pig Constructs • grunt> macro_max_by_group_A_0 = GROUP records by (year); • grunt> max_temp = FOREACH macro_max_by_group_A_0 GENERATE group, MAX(records.(temperature)); • grunt> DUMP max_temp;  (1949,111)  (1950,22)
  • 40. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 UDF UDF are User Defined Functions It gives the ability to plug in custom code We only cover Java UDF’s in this section UDF’s can also be written in Python and JavaScript Types of UDF: • Filter UDF • Eval UDF • Load UDF
  • 41. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Filter UDF Filter UDF’s are UDF used as filter in Pig The idea is to change this line • filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); To • filtered_records = FILTER records BY temperature != 9999 AND isGood(quality); Filter UDFs are all subclasses of FilterFunc, which itself is a subclass of EvalFunc. EvalFunc looks like the following class: • public abstract class EvalFunc<T> { public abstract T exec(Tuple input) throws IOException; }
  • 42. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Filter UDF EvalFunc’s only abstract method, exec(), takes a tuple and returns a single value, the (parameterized) type T. For FilterFunc, T is Boolean, so the method should return true only for those tuples that should not be filtered out.
  • 43. package org.vmware.training; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.pig.FilterFunc; import org.apache.pig.FuncSpec; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema;
  • 44. public class IsGoodQuality extends FilterFunc { @Override public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() == 0) { return false; } try { Object object = tuple.get(0); if (object == null) { return false; } int i = (Integer) object; return i == 0 || i == 1 || i == 4 || i == 5 || i == 9; } catch (ExecException e) { throw new IOException(e); } } }
  • 45. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Lets write out the code $ mkdir ~/UDF $ mkdir ~/UDF/IsGoodQuality $ mkdir ~/UDF/IsGoodQuality/IsGoodQualityClasses $ cd ~/UDF/IsGoodQuality $ vim IsGoodQuality.java Copy the code from previous slide into it
  • 46. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Lets compile the code and create a jar Set the classpath • $ export CLASSPATH=/usr/lib/hadoop/*:/usr/lib/hadoop/client- 0.20/*:/usr/lib/pig/*:/usr/lib/pig/lib/* Compile the java class • $ javac -d IsGoodQualityClasses IsGoodQuality.java Create a jar • $ jar -cvf goodquality.jar -C IsGoodQualityClasses/ .
  • 47. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Lets use the UDF in our code Start the pig console • $ cd • $ pig Register the jar file • grunt> REGISTER /root/UDF/IsGoodQuality/goodquality.jar; Use the UDF from the jar • grunt> records = LOAD 'sample.txt' AS (year:chararray, temperature:int, quality:int); • grunt> filtered_records = FILTER records BY temperature != 9999 AND org.vmware.training.IsGoodQuality(quality); • grunt> DUMP filtered_records;
  • 48. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Join Prerequisite • $ hadoop fs –copyFromLocal /root/input/pig/join/A . • $ hadoop fs –copyFromLocal /root/input/pig/join/B . • $ pig • grunt> A = LOAD 'A' AS (id:int, item:chararray); • grunt> B = LOAD 'B' AS (name:chararray, id:int); • grunt> DUMP A;  (2,Tie)  (4,Coat)  (3,Hat)  (1,Scarf) • grunt> DUMP B;  (Joe,2)  (Hank,4)  (Ali,0)  (Eve,3)  (Hank,2)
  • 49. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Join Inner Join • grunt> C = JOIN A BY $0, B BY $1; • grunt> DUMP C;  (2,Tie,Joe,2)  (2,Tie,Hank,2)  (3,Hat,Eve,3)  (4,Coat,Hank,4) Outer Join • grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;  (1,Scarf,,)  (2,Tie,Joe,2)  (2,Tie,Hank,2)  (3,Hat,Eve,3)  (4,Coat,Hank,4)
  • 50. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 COGROUP Similar to JOIN Creates nested set of output tuples rather than a flat structure • grunt> D = COGROUP A BY $0, B BY $1; • grunt> DUMP D;  (0,{},{(Ali,0)})  (1,{(1,Scarf)},{})  (2,{(2,Tie)},{(Joe,2),(Hank,2)})  (3,{(3,Hat)},{(Eve,3)})  (4,{(4,Coat)},{(Hank,4)})
  • 51. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 CROSS Joins every tuple in a relation with every tuple in a second relation • grunt> I = CROSS A, B; • grunt> DUMP I;  (2,Tie,Joe,2)  (2,Tie,Hank,4)  (2,Tie,Ali,0)  (2,Tie,Eve,3)  (2,Tie,Hank,2)  (4,Coat,Joe,2)  (4,Coat,Hank,4) ….
  • 52. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 STREAM The STREAM operator allows you to transform data in a relation using an external program or script It is named by analogy with Hadoop Streaming, which provides a similar capability for MapReduce • grunt> DUMP A;  (2,Tie)  (4,Coat)  (3,Hat)  (1,Scarf) • grunt> C = STREAM A THROUGH `cut -f 2`; • grunt> DUMP C;  (Tie)  (Coat)  (Hat)  (Scarf)
  • 53. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 STREAM The STREAM operator uses PigStorage to serialize and deserialize relations to and from the program’s standard input and output streams Tuples in A are converted to tab delimited lines that are passed to the script. The output of the script is read one line at a time and split on tabs to create new tuples for the output relation C.