An Introduction to Apache Pig

Apache Pig
Sachin Vakkund
KLE Technological University
sachinvakkund6@gmail.com
linkedin.com/in/sachinvakkund

WHAT IS PIG ?
• Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop.
• Is a tool/platform which is used to analyze larger sets of data representing them as data flows.
• Pig generates and compiles a Map/Reduce program(s) on the fly.
• To write data analysis programs, Pig provides a high-level language known as Pig Latin.
• Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.

WHY PIG ?
• Using Pig Latin, programmers can perform MapReduce tasks easily without having
to type complex codes in Java.
• Instead of writing 200 LoC in java, we write 10 LoC in Apache_Pig for an operation.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc.

FEATURES OF PIG
• Rich set of operators - join, sort, filer, etc.
• Ease of programming - similar to SQL
• Optimization opportunities – optimize automatically
• Extensibility - users can develop their own functions to read, process, and write data.
• Handles all kinds of data – analyzes structured and unstructured data

Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is pretty
simple.
It is quite difficult in MapReduce to perform a Join
operation between datasets.
Any novice programmer with a basic knowledge of
can work conveniently with Apache Pig.
Exposure to Java is must to work with MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more the
number of lines to perform the same task.
There is no need for compilation. On execution, every
Apache Pig operator is converted internally into a
MapReduce job.
MapReduce jobs have a long compilation process.
Apache Pig vs MapReduce

Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can
store data without designing a schema
(values are stored as $01, $02 etc.)
Schema is mandatory in SQL.
The data model in Apache Pig is nested
relational.
The data model used in SQL is flat relational.
Apache Pig provides limited opportunity
for Query optimization.
There is more opportunity for query
optimization in SQL.
Apache Pig vs SQL

Apache Pig Vs Hive
Apache Pig Hive
Apache Pig uses a language called Pig
It was originally created at Yahoo.
Hive uses a language called HiveQL. It was
originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits
in pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured,
unstructured, and semi-structured data.
Hive is mostly for structured data.

Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks
involving ad-hoc processing and quick prototyping.
Apache Pig is used −
•To process huge data sources such as web logs.
•To perform data processing for search platforms.
•To process time sensitive data loads.

CREATE YOUR FIRST PIG PROGRAM
 Problem Statement:
 Find out Number of Products Sold in Each Country.
 Input: Our input data set is a CSV file, SalesJan2009.csv

PREREQUISITES:
 This is developed on Linux - Ubuntu operating System.
 You should have Hadoop (version 2.2.0 used for this tutorial) already
installed and is running on the system.
 You should have Java (version 1.8.0 used for this tutorial) already
installed on the system.
 You should have set JAVA_HOME accordingly.
 This guide is divided into 2 parts
 Pig Installation
 Pig Demo

PART 1) PIG INSTALLATION
 change user to 'hduser' (user used for Hadoop configuration).
 Step 1) Download stable latest release of Pig (version 0.12.1 used for
this tutorial) from any one of the mirrors sites available at
 http://pig.apache.org/releases.html
 Select tar.gz (and not src.tar.gz) file to download.
 Step 2) Once download is complete, navigate to the directory
containing the downloaded tar file and move the tar to the location
where you want to setup Pig. In this case we will move to /usr/local
 Move to directory containing Pig Files
 cd /usr/local
 Extract contents of tar file as below
 sudo tar -xvf pig-0.12.1.tar.gz

 Step 3). Modify ~/.bashrc to add Pig related environment variables
 Open ~/.bashrc file in any text editor of your choice and do below
modifications-
 export PIG_HOME=<Installation directory of Pig>
 export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
 Step 4) Now, source this environment configuration using below
command
 . ~/.bashrc

 Step 5) We need to recompile PIG to support Hadoop 2.2.0
 Here are the steps to do this-
 Go to PIG home directory
 cd $PIG_HOME
 Install ant
 sudo apt-get install ant
 Step 6) Test the Pig installation using command
 pig -help

PART 2) PIG DEMO
 Step 7) Start Hadoop
 $HADOOP_HOME/sbin/start-dfs.sh
 $HADOOP_HOME/sbin/start-yarn.sh
 Step 8) Pig takes file from HDFS in MapReduce mode and stores the
results back to HDFS.
 Copy file SalesJan2009.csv (stored on local file
system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File
System) Home Directory
 Here the file is in Folder input. If the file is stored in some other
location give that name
 $HADOOP_HOME/bin/hdfs dfs -copyFromLocal
~/input/SalesJan2009.csv /

 Step 9) Pig Configuration
 First navigate to $PIG_HOME/conf
 cd $PIG_HOME/conf
 sudo cp pig.properties pig.properties.original
 Open pig.properties using text editor of your choice, and specify log
file path using pig.logfile
 sudo gedit pig.properties

 Step 10) Run command 'pig' which will start Pig command prompt which is an interactive
shell Pig queries.
 Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.
 Press Enter after this command.
 Group data by field Country
 -- GroupByCountry = GROUP salesTable BY Country;
 For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of
Country : No. of products sold.
 -- CountByCountry = FOREACH GroupByCountry GENERATE
CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
 Press Enter after this command.
salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:charar
ray,Last_Login:chararray,Latitude:chararray,Longitude:chararray);

 Store the results of Data Flow in the directory 'pig_output_sales' on
HDFS
 -- STORE CountByCountry INTO 'pig_output_sales' USING
PigStorage('t');
 Step 12) Result can be seen through command interface as,
 $HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
 Step 12) Result can be seen through command interface as,
 $HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

CONCLUSION
 Pig enables people to focus more on analyzing bulk data sets and to
spend less time in writing Map-Reduce programs.

An Introduction to Apache Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to Apache Pig

Similar to An Introduction to Apache Pig (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Apache Pig