3. In this session you will learn about:
Apache pig Vs MapReduce
Introduction to pig
Twitter Case Study
Pig Architecture
Installing and Running Pig
User define fuction
Learning Objectives
4. As we mentioned in our Hadoop Ecosystem blog, Apache Pig is an essential part of our Hadoop
ecosystem.
Before starting with the Apache Pig.
question
“while MapReduce was there for Big Data Analytics why Apache Pig came into
picture?“
approximately 10 lines of Pig code is equal to 200 lines of MapReduce code.
5. Programmers face difficulty writing MapReduce tasks as it requires Java or Python
programming knowledge. For them, Apache Pig is a savior.
Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data
processing paradigm.
Without writing complex Java implementations in MapReduce, programmers can achieve
the same implementations very easily using Pig Latin.
Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can
accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times.
Hence, this reduces the development period by almost 16 times.
Apache Pig vs MapReduce
6. Pig provides many built-in operators to support data operations like joins, filters, ordering,
sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce
to perform a Join operation between the data sets, as it requires multiple MapReduce tasks
to be executed sequentially to fulfill the job.
In addition, it also provides nested data types like tuples, bags, and maps that are missing
from MapReduce.
7. Apache Pig is a platform, used to analyze large data sets representing them as data flows. It
is designed to provide an abstraction over MapReduce, reducing the complexities of writing
a MapReduce program. We can perform data manipulation operations very easily in Hadoop
using Apache Pig.
The features of Apache pig are:
Pig enables programmers to write complex data transformations without knowing Java.
Apache Pig has two main components – the Pig Latin language and the Pig Run-time
Environment, in which Pig Latin programs are executed.
For Big Data Analytics, Pig gives a simple data flow language known as Pig Latin which
has functionalities similar to SQL like join, filter, limit etc.
Introduction to Apache Pig
8. Developers who are working with scripting languages and SQL, leverages Pig Latin. This
gives developers ease of programming with Apache Pig. Pig Latin provides various built-in
operators like join, sort, filter, etc to read, write, and process large data sets. Thus it is
evident, Pig has a rich set of operators.
Programmers write scripts using Pig Latin to analyze data and these scripts are internally
converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writing
MapReduce tasks was the only way to process the data stored in HDFS.
If a programmer wants to write custom functions which is unavailable in Pig, Pig allows
them to write User Defined Functions (UDF) in any language of their choice like Java,
Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This
provides extensibility to Apache Pig.
9. Pig can process any kind of data, i.e. structured, semi-structured or unstructured data,
coming from various sources. Apache Pig handles all kinds of data.
Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code.
It can handle inconsistent schema (in case of unstructured data).
Apache Pig extracts the data, performs operations on that data and dumps the data in the
required format in HDFS i.e. ETL (Extract Transform Load).
Apache Pig automatically optimizes the tasks before execution, i.e. automatic
optimization.
It allows programmers and developers to concentrate upon the whole operation irrespective
of creating mapper and reducer functions separately.
10. Apache Pig is used for analyzing and performing tasks involving ad-hoc processing.
Apache Pig is used:
Where we need to process, huge data sets like Web logs, streaming online data, etc.
Where we need Data processing for search platforms (different types of data needs to be
processed) like Yahoo uses Pig for 40% of their jobs including news feeds and search
engine.
Where we need to process time sensitive data loads. Here, data needs to be extracted and
analyzed quickly. E.g. machine learning algorithms requires time sensitive data loads, like
twitter needs to quickly extract data of customer activities (i.e. tweets, re-tweets and likes) and
analyze the data to find patterns in customer behaviors, and make recommendations
immediately like trending tweets.
Where to use Apache Pig?
11. case study of Twitter where Twitter adopted Apache Pig.
Twitter’s data was growing at an accelerating rate (i.e. 10 TB data/day). Thus, Twitter decided
to move the archived data to HDFS and adopt Hadoop for extracting the business values out
of it. Their major aim was to analyse data stored in Hadoop to come up with the following
insights on a daily, weekly or monthly basis.
Counting operations:
How many requests twitter serve in a day?
What is the average latency of the requests?
How many searches happens each day on Twitter?
How many unique queries are received?
How many unique users come to visit?
What is the geographic distribution of the users?
Apache Pig Tutorial: Twitter Case Study
12. Correlating Big Data:
How usage differs for mobile users?
Cohort analysis: analyzing data by categorizing user, based on their behavior.
What goes wrong while site problem occurs?
Which features user often uses?
Search correction and search suggestions.
Research on Big Data & produce better outcomes like:
What can Twitter analysis about users from their tweets?
Who follows whom and on what basis?
What is the ratio of the follower to following?
What is the reputation of the user?
13. So, Twitter moved to Apache Pig for analysis.
Now, joining data sets, grouping them, sorting them and retrieving data becomes
easier and simpler.
You can see in the below image how twitter used Apache Pig to analyse their large
data set.
14. Twitter had both semi-structured data like Twitter Apache logs, Twitter search logs, Twitter
MySQL query logs, application logs and structured data like tweets, users, block
notifications, phones, favorites, saved searches, re-tweets, authentications, SMS usage,
user followings, etc. which can be easily processed by Apache Pig.
Twitter dumps all its archived data on HDFS.
It has two tables i.e. user data and tweets data.
User data contains information about the users like username, followers, followings, number
of tweets etc.
While Tweet data contains tweet, its owner, number of re-tweets, number of likes etc. Now,
twitter uses this data to analyse their customer’s behaviors and improve their past
experiences.
15. Analyzing how many tweets are stored per user, in the given tweet tables?
16. STEP 1– First of all, twitter imports the twitter tables (i.e. user table and tweet table) into the HDFS.
STEP 2– Then Apache Pig loads (LOAD) the tables into Apache Pig framework.
STEP 3– Then it joins and groups the tweet tables and user table using COGROUP command as shown
in the above image.
This results in the inner Bag Data type, which we will discuss later in this blog.
Example of Inner bags produced (refer to the above image) –
(1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})
(2,{(2,Ellie,abc),(2,Ellie,vxy)})
(3, {(3,Sam,stu)})
STEP 4– Then the tweets are counted according to the users using COUNT command. So, that the total
number of tweets per user can be easily calculated.
Example of tuple produced as (id, tweet count) (refer to the above image) –
(1, 3)
(2, 2)
(3, 1)
STEP 5– At last the result is joined with user table to extract the user name with produced result.
Example of tuple produced as (id, name, tweet count) (refer to the above image) –
(1, Jay, 3)
(2, Ellie, 2)
(3, Sam, 1)
STEP 6– Finally, this result is stored back in the HDFS.
The step by step solution of this problem is shown in the above image.
17. For writing a Pig script, we need Pig Latin language and to execute them, we need an execution
environment. The architecture of Apache Pig is shown in the below image.
Architecture
This is Pig’s interactive shell provided
to execute all Pig Scripts.
Write all the Pig commands in a
script file and execute the Pig
script file. This is executed by
the Pig Server.
Pig Scripts are passed to the Parser.
The Parser does type checking and
checks the syntax of the script.
DAG (directed acyclic graph
DAG is submitted to the optimizer. The
Optimizer performs the optimization activities
like split, merge, transform, and reorder
operators etc. This optimizer provides the
automatic optimization feature to Apache Pig.
compiler is the one who is responsible
for converting Pig jobs automatically into
MapReduce jobs.
18. Apache Pig Installation on Linux:
Below are the steps for Apache Pig Installation on Linux (ubuntu/centos/windows using
Linux VM). I am using Ubuntu 16.04 in below setup.
Step 1: Download Pig tar file.
Command: wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
19. Step 2: Extract the tar file using tar command.
In below tar command, x means extract an archive file, z means filter an archive through gzip, f means
filename of an archive file.
Command: tar -xzf pig-0.16.0.tar.gz
Command: ls
20. Step 3: Edit the “.bashrc” file to update the environment variables of Apache Pig.
We are setting it so that we can access pig from any directory, we need not go to pig directory to
execute pig commands. Also, if any other application is looking for Pig, it will get to know the path of
Apache Pig from this file.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set PIG_HOME
export PIG_HOME=/home/edureka/pig-0.16.0
export PATH=$PATH:/home/edureka/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
Also, make sure that hadoop path is also set.
Run below command to make the changes get updated in
same terminal.
Command: source .bashrc
21. Step 4: Check pig version.
Command: pig -version
Step 5: Check pig help to see all the pig command options.
Command: pig -help
22. Step 6: Run Pig to start the grunt shell. Grunt shell is used to run Pig Latin scripts.
Command: pig
If you look at the above image correctly, Apache Pig has two modes in which it can run, by
default it chooses MapReduce mode. The other mode in which you can run Pig is Local
mode.
23. MapReduce Mode – This is the default mode, which requires access to a Hadoop cluster and HDFS
installation. Since, this is a default mode, it is not necessary to specify -x flag ( you can
execute pig OR pig -x mapreduce). The input and output in this mode are present on HDFS.
Local Mode – With access to a single machine, all files are installed and run using a local host and
file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this
mode are present on local file system.
Command: pig -x local
Execution modes in Apache Pig
24. Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter, we are
going to discuss the basics of Pig Latin such as Pig Latin statements, data types, general and relational
operators, and Pig Latin UDF’s.
Pig Latin – Data Model
As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the
outermost structure of the Pig Latin data model. And it is a bag where −
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
Pig Latin
25. Pig Latin – Statements
While processing data using Pig Latin, statements are the basic constructs.
These statements work with relations. They include expressions and schemas.
Every statement ends with a semicolon (;).
We will perform various operations using operators provided by Pig Latin, through statements.
Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation
as input and produce another relation as output.
As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To
see the contents of the schema, you need to use the Dump operator. Only after performing
the dump operation, the MapReduce job for loading the data into the file system will be carried out.
26. S.N. Data Type Description & Example
1 int Represents a signed 32-bit integer.
Example : 8
2 long Represents a signed 64-bit integer.
Example : 5L
3 float Represents a signed 32-bit floating point.
Example : 5.5F
4 double Represents a 64-bit floating point.
Example : 10.5
5 chararray Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’
6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.
Example : true/ false.
8 Datetime Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00
9 Biginteger Represents a Java BigInteger.
Example : 60708090709
Pig Latin – Data types
27. Operator Description Example
() Tuple constructor operator −
This operator is used to
construct a tuple.
(Raju, 30)
{} Bag constructor operator −
This operator is used to
construct a bag.
{(Raju, 30), (Mohammad, 45)}
[] Map constructor operator −
This operator is used to
construct a tuple.
[name#Raja, age#30]
Pig Latin – Type Construction Operators
28. Pig Latin – Relational Operations
Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
29. Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution
plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of
statements.