Apache Pig is a platform for analyzing large datasets that runs on Hadoop. It provides a high-level language called Pig Latin that allows users to write data analysis programs without having to write complex MapReduce code in Java. Pig Latin scripts are compiled into MapReduce jobs. Pig offers features like built-in operators for joins, filters, and ordering and can handle both structured and unstructured data.
2. WHAT IS PIG ?
• Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop.
• Is a tool/platform which is used to analyze larger sets of data representing them as data flows.
• Pig generates and compiles a Map/Reduce program(s) on the fly.
• To write data analysis programs, Pig provides a high-level language known as Pig Latin.
• Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.
3. WHY PIG ?
• Using Pig Latin, programmers can perform MapReduce tasks easily without having
to type complex codes in Java.
• Instead of writing 200 LoC in java, we write 10 LoC in Apache_Pig for an operation.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc.
4. FEATURES OF PIG
• Rich set of operators - join, sort, filer, etc.
• Ease of programming - similar to SQL
• Optimization opportunities – optimize automatically
• Extensibility - users can develop their own functions to read, process, and write data.
• Handles all kinds of data – analyzes structured and unstructured data
5. Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is pretty
simple.
It is quite difficult in MapReduce to perform a Join
operation between datasets.
Any novice programmer with a basic knowledge of
can work conveniently with Apache Pig.
Exposure to Java is must to work with MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more the
number of lines to perform the same task.
There is no need for compilation. On execution, every
Apache Pig operator is converted internally into a
MapReduce job.
MapReduce jobs have a long compilation process.
Apache Pig vs MapReduce
6. Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can
store data without designing a schema
(values are stored as $01, $02 etc.)
Schema is mandatory in SQL.
The data model in Apache Pig is nested
relational.
The data model used in SQL is flat relational.
Apache Pig provides limited opportunity
for Query optimization.
There is more opportunity for query
optimization in SQL.
Apache Pig vs SQL
7. Apache Pig Vs Hive
Apache Pig Hive
Apache Pig uses a language called Pig
It was originally created at Yahoo.
Hive uses a language called HiveQL. It was
originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits
in pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured,
unstructured, and semi-structured data.
Hive is mostly for structured data.
8. Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks
involving ad-hoc processing and quick prototyping.
Apache Pig is used −
•To process huge data sources such as web logs.
•To perform data processing for search platforms.
•To process time sensitive data loads.
9. CREATE YOUR FIRST PIG PROGRAM
Problem Statement:
Find out Number of Products Sold in Each Country.
Input: Our input data set is a CSV file, SalesJan2009.csv
10. PREREQUISITES:
This is developed on Linux - Ubuntu operating System.
You should have Hadoop (version 2.2.0 used for this tutorial) already
installed and is running on the system.
You should have Java (version 1.8.0 used for this tutorial) already
installed on the system.
You should have set JAVA_HOME accordingly.
This guide is divided into 2 parts
Pig Installation
Pig Demo
11. PART 1) PIG INSTALLATION
change user to 'hduser' (user used for Hadoop configuration).
Step 1) Download stable latest release of Pig (version 0.12.1 used for
this tutorial) from any one of the mirrors sites available at
http://pig.apache.org/releases.html
Select tar.gz (and not src.tar.gz) file to download.
Step 2) Once download is complete, navigate to the directory
containing the downloaded tar file and move the tar to the location
where you want to setup Pig. In this case we will move to /usr/local
Move to directory containing Pig Files
cd /usr/local
Extract contents of tar file as below
sudo tar -xvf pig-0.12.1.tar.gz
12. Step 3). Modify ~/.bashrc to add Pig related environment variables
Open ~/.bashrc file in any text editor of your choice and do below
modifications-
export PIG_HOME=<Installation directory of Pig>
export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
Step 4) Now, source this environment configuration using below
command
. ~/.bashrc
13. Step 5) We need to recompile PIG to support Hadoop 2.2.0
Here are the steps to do this-
Go to PIG home directory
cd $PIG_HOME
Install ant
sudo apt-get install ant
Step 6) Test the Pig installation using command
pig -help
14. PART 2) PIG DEMO
Step 7) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 8) Pig takes file from HDFS in MapReduce mode and stores the
results back to HDFS.
Copy file SalesJan2009.csv (stored on local file
system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File
System) Home Directory
Here the file is in Folder input. If the file is stored in some other
location give that name
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal
~/input/SalesJan2009.csv /
15. Step 9) Pig Configuration
First navigate to $PIG_HOME/conf
cd $PIG_HOME/conf
sudo cp pig.properties pig.properties.original
Open pig.properties using text editor of your choice, and specify log
file path using pig.logfile
sudo gedit pig.properties
16. Step 10) Run command 'pig' which will start Pig command prompt which is an interactive
shell Pig queries.
Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.
Press Enter after this command.
Group data by field Country
-- GroupByCountry = GROUP salesTable BY Country;
For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of
Country : No. of products sold.
-- CountByCountry = FOREACH GroupByCountry GENERATE
CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
Press Enter after this command.
salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:charar
ray,Last_Login:chararray,Latitude:chararray,Longitude:chararray);
17. Store the results of Data Flow in the directory 'pig_output_sales' on
HDFS
-- STORE CountByCountry INTO 'pig_output_sales' USING
PigStorage('t');
Step 12) Result can be seen through command interface as,
$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
Step 12) Result can be seen through command interface as,
$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
18. CONCLUSION
Pig enables people to focus more on analyzing bulk data sets and to
spend less time in writing Map-Reduce programs.