SlideShare a Scribd company logo
1 of 48
By,
Anuja Gunale-Kasle
CONTENTS:
PIG background
PIG Architecture
PIG Latin Basics
PIG Execution Modes
PIG Processing: loading and transforming data
PIG Built-in functions
Filtering, grouping, sorting
Data Installation of PIG and PIG Latin commands
Apache Pig was
originally developed
at Yahoo Research around
2006 for researchers to have
an ad hoc way of creating
and executing MapReduce
jobs on very large data sets.
In 2007, it was moved into
the Apache Software
Foundation.
 The story goes that the researchers
working on the project initially
referred to it simply as 'the
language'. Eventually they needed
to call it something.
 Off the top of his head, one
researcher suggested Pig, and the
name stuck.
 It is quirky yet memorable and easy
to spell.
 While some have hinted that the
name sounds coy or silly, it has
provided us with an entertaining
nomenclature, such as Pig Latin for
the language, Grunt for the shell,
and PiggyBank for the CPAN-like
shared repository.
 Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial
is designed for beginners and professionals.
 Pig is a high-level data flow platform for executing Map Reduce programs
of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.
 Pig provides an engine for executing data flows in parallel on Hadoop. It
includes a language, Pig Latin, for expressing these data flows.
 Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
 Pig is an Apache open source project. This means users are free to
download it as source or binary, use it for themselves, contribute to it,
and—under the terms of the Apache License—use it in their products and
change it as they see fit.
Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig
is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS. Apart from that,
Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-
structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used
in MapReduce.
FEATURES OF APACHE PIG
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
ADVANTAGES OF APACHE PIG
•Less code - The Pig consumes less line of
code to perform any operation.
•Reusability - The Pig code is flexible
enough to reuse again.
•Nested data types - The Pig provides a
useful concept of nested data types like
tuple, bag, and map.
HIVE VS PIG VS SQL – WHEN TO USE WHAT?
When to Use Hive
 Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually
promote Hive language due to its extensive feature list and similarities with SQL. Here are
some of the scenarios when Apache Hive is ideal to use:
• To query large datasets: Apache Hive is specially used for analytics purposes on
huge datasets. It is an easy way to approach and quickly carry out complex querying on
datasets and inspect the datasets stored in the Hadoop ecosystem.
• For extensibility: Apache Hive contains a range of user APIs that help in building
the custom behaviour for the query engine.
• For someone familiar with SQL concepts: If you are familiar with SQL, Hive
will be very easy to use as you will see many similarities between the two. Hive uses the
clauses like select, where, order by, group by, etc. similar to SQL.
• To work on Structured Data: In case of structured data, Hive is widely adopted
everywhere.
• To analyse historical data: Apache Hive is a great tool for analysis and querying
of the data which is historical and collected over a period.
When to Use Pig
 Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios:
• To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big
data. It is a data flow system that uses Pig Latin, a simple language for data queries and
manipulation.
• As a programmer with the scripting knowledge: The programmers with the
scripting knowledge can learn how to use Apache Pig very easily and efficiently.
• For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
• When you don’t want to work with Schema: In case of Apache Pig, there is no need
for creating a schema for the data loading related work.
• For SQL like functions: It has many functions related to SQL along with the cogroup
function.
When to Use SQL
 SQL is a general purpose database management language used around the globe. It has
been updating itself as per the user expectations for decades. It is declarative and hence
focuses explicitly on ‘what’ is needed.
 It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few
scenarios –
• For better performance: SQL is famous for its ability to pull data quickly and
frequently. It supports OLAP (Online Analytical Processing) applications and performs
better for these applications. Hive is slow in case of online transactional needs.
• When the datasets are small: SQL works well with small datasets and
performs much better for smaller amounts of data. It also has many ways for the
optimisation of data.
• For frequent data manipulation: If your requirement needs frequent
modification in records or you need to update a large number of records frequently, SQL
can perform these activities well. SQL also provides an entirely interactive experience to
the user.
APACHE PIG RUN MODES
 Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
• It executes in a single JVM and is used for development experimenting and
prototyping.
• Here, files are installed and run using localhost.
• The local mode works on a local file system. The input and output data stored
in the local file system.
 The command for local mode grunt shell:
1.$ pig-x local
MapReduce Mode
• The MapReduce mode is also known as Hadoop Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and executes them on the
cluster.
• It can be executed against semi-distributed or fully distributed Hadoop
installation.
• Here, the input and output data are present on HDFS.
 The command for Map reduce mode:
WAYS TO EXECUTE PIG PROGRAM
These are the following ways of executing a Pig program on
local and MapReduce mode: -
• Interactive Mode - In this mode, the Pig is executed in the
Grunt shell. To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
• Batch Mode - In this mode, we can run a script file having a
.pig extension. These files contain Pig Latin commands.
• Embedded Mode - In this mode, we can define our own
functions. These functions can be called as UDF (User
Defined Functions). Here, we use programming languages
like Java and Python.
PIG
ARCHITE
CTURE
The language used to analyse data in Hadoop using Pig is known
as Pig Latin.
It is a high-level data processing language which provides a rich
set of data types and operators to perform various operations on
the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the
desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
APACHE PIG COMPONENTS
As shown in the figure, there are various components in the Apache
Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax
of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to
Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
PIG LATIN DATA MODEL
The data model of Pig Latin is
fully nested and it allows
complex non-atomic datatypes
such as map and tuple.
Given below is the
diagrammatical representation
of Pig Latin’s data model.
An atomic value is one that is indivisible within the
context of a database field definition (e.g. integer,
real, code of some sort etc.)
Field values that are not atomic are of two
undesirable types (Elmasri & Navathe 1989
p.139,41):
Undesirable - non atomic field types: Composite.
Multivalued.
Atom
Any single value in Pig Latin, irrespective of their data,
type is known as an Atom. It is stored as string and
can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is
known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is
known as a tuple, the fields can be of any type. A tuple
is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples.
In other words, a collection of tuples (non-unique) is known as a
bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’.
 It is similar to a table in RDBMS, but unlike a table in RDBMS, it
is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the
same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented
by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin
are unordered (there is no guarantee that tuples are
processed in any particular order).
Grunt shell is a shell command.
The Grunts shell of Apace pig is mainly used to write
pig Latin scripts.
Pig script can be executed with grunt shell which is
native shell provided by Apache pig to execute pig
queries.
We can invoke shell commands using sh and fs.
JOB EXECUTION FLOW
The developer creates the scripts, and then it goes to
the local file system as functions.
Moreover, when the developers submit Pig Script, it
contacts with Pig Latin Compiler.
The compiler then splits the task and run a series of MR
jobs.
Meanwhile, Pig Compiler retrieves data from the HDFS.
The output file again goes to the HDFS after running MR
jobs.
a. Pig Execution Modes
We can run Pig in two execution modes.
These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing.
We can thus store data on a single machine or in a
distributed environment like Clusters.
The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to
create a file, load the code and execute the script.
 Then comes the Grunt shell or interactive shell for
running Apache Pig commands.
Hence, the last one named as embedded mode, which
we can use JDBC to run SQL programs from Java.
b. Pig Local mode
However, in this mode, pig implements on single
JVM and access the file system.
This mode is better for dealing with the small data
sets.
Meanwhile, the parallel mapper execution is
impossible.
The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig
local mode of execution.
Therefore, Pig always looks for the local file system
path while loading data.
c. Pig Map Reduce Mode
In this mode, a user could have proper Hadoop
cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig
also translates the queries into Map reduce
jobs and runs on top of Hadoop cluster. Hence,
this mode as a Map reduce runs on a
distributed cluster.
The statements like LOAD, STORE read the
data from the HDFS file system and to show
output. These Statements are also used to
d. Storing Results
The intermediate data generates during the
processing of MR jobs.
Pig stores this data in a non-permanent location
on HDFS storage.
The temporary location then created inside
HDFS for storing this intermediate data.
We can use DUMP for getting the final results
to the output screen.
The output results stored using STORE
operator.
Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
BinStorage
Loads and stores data in machine-readable (binary) format
PigStorage
Loads and stores data as structured, field delimited text files
TextLoader
Loads unstructured data in UTF-8 format
PigDump
Stores data in UTF-8 format
YourOwnFormat!
via UDFs
Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name, salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
Each LOAD statement defines a new bag
Each bag can have multiple elements (atoms)
Each element can be referenced by name or position ($n)
A bag is immutable
A bag can be aliased and referenced later
STORE
Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
 Fails if directory exists
 Writes output files, part-[m|r]-xxxxx, to the directory
PigStorage can be used to specify a field delimiter
DUMP
Write output to screen
grunt> DUMP processed;
FOREACH
Applies expressions to every record in a bag
FILTER
Filters by expression
GROUP
Collect records with the same key
ORDER BY
Sorting
DISTINCT
Removes duplicates
Use the FILTER operator to restrict tuples or rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
 Use the GROUP…ALL operator to group data
 Use GROUP when only one relation is involved
 Use COGROUP with multiple relations are involved
 Basic syntax:
alias2 = GROUP alias1 ALL;
 Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Use the ORDER…BY operator to sort a relation based on one
or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Use the DISTINCT operator to remove duplicate tuples in
a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
FLATTEN
Used to un-nest tuples as well as bags
INNER JOIN
Used to perform an inner join of two or more relations based on
common field values
OUTER JOIN
Used to perform left, right or full outer joins
SPLIT
Used to partition the contents of a relation into two or more
relations
SAMPLE
Used to select a random data sample with the stated sample
size
Use the JOIN operator to perform an inner, equi-
join join of two or more relations based on common
field values
The JOIN operator always performs an inner join
Inner joins ignore null keys
Filter null keys before the join
JOIN and COGROUP operators perform similar
functions
 JOIN creates a flat set of output records
COGROUP creates a nested set of output records
DUMP Alias1;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
DUMP Alias2;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
Join Alias1 by Col1 to
Alias2 by Col1
Alias3 = JOIN Alias1 BY
Col1, Alias2 BY Col1;
Dump Alias3;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Use the OUTER JOIN operator to perform left, right, or full
outer joins
 Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
 keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
Outer joins will only work for two-way joins
 To perform a multi-way outer join perform multiple two-way outer
join statements
Natively written in Java, packaged as a jar file
Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER statement
Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE can be used to work with UDFs and also
streaming commands
Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING
PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
/* Define UDFs to a more readable format */
DEFINE MAXNUM
org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
THANK YOU…

More Related Content

What's hot

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

What's hot (20)

Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Java servlets
Java servletsJava servlets
Java servlets
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Sqoop
SqoopSqoop
Sqoop
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hive
HiveHive
Hive
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Similar to Apache PIG

BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptYashJadhav496388
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 

Similar to Apache PIG (20)

Apache pig
Apache pigApache pig
Apache pig
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Pig
PigPig
Pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
What is apache pig
What is apache pigWhat is apache pig
What is apache pig
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
 
Apache pig
Apache pigApache pig
Apache pig
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 

Recently uploaded

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 

Recently uploaded (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

Apache PIG

  • 2. CONTENTS: PIG background PIG Architecture PIG Latin Basics PIG Execution Modes PIG Processing: loading and transforming data PIG Built-in functions Filtering, grouping, sorting Data Installation of PIG and PIG Latin commands
  • 3. Apache Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.
  • 4.  The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something.  Off the top of his head, one researcher suggested Pig, and the name stuck.  It is quirky yet memorable and easy to spell.  While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository.
  • 5.  Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed for beginners and professionals.  Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.  Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows.  Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.  Pig is an Apache open source project. This means users are free to download it as source or binary, use it for themselves, contribute to it, and—under the terms of the Apache License—use it in their products and change it as they see fit.
  • 6. Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop. The language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark. Pig can handle any type of data, i.e., structured, semi- structured or unstructured and stores the corresponding results into Hadoop Data File System. Every task which can be achieved using PIG can also be achieved using java used in MapReduce.
  • 7. FEATURES OF APACHE PIG Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions
  • 8. ADVANTAGES OF APACHE PIG •Less code - The Pig consumes less line of code to perform any operation. •Reusability - The Pig code is flexible enough to reuse again. •Nested data types - The Pig provides a useful concept of nested data types like tuple, bag, and map.
  • 9.
  • 10.
  • 11. HIVE VS PIG VS SQL – WHEN TO USE WHAT? When to Use Hive  Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually promote Hive language due to its extensive feature list and similarities with SQL. Here are some of the scenarios when Apache Hive is ideal to use: • To query large datasets: Apache Hive is specially used for analytics purposes on huge datasets. It is an easy way to approach and quickly carry out complex querying on datasets and inspect the datasets stored in the Hadoop ecosystem. • For extensibility: Apache Hive contains a range of user APIs that help in building the custom behaviour for the query engine. • For someone familiar with SQL concepts: If you are familiar with SQL, Hive will be very easy to use as you will see many similarities between the two. Hive uses the clauses like select, where, order by, group by, etc. similar to SQL. • To work on Structured Data: In case of structured data, Hive is widely adopted everywhere. • To analyse historical data: Apache Hive is a great tool for analysis and querying of the data which is historical and collected over a period.
  • 12. When to Use Pig  Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and optimization scope. This language uses a multi-query approach that reduces the time in data scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios: • To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big data. It is a data flow system that uses Pig Latin, a simple language for data queries and manipulation. • As a programmer with the scripting knowledge: The programmers with the scripting knowledge can learn how to use Apache Pig very easily and efficiently. • For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach. Apache Pig is famous worldwide for its speed. • When you don’t want to work with Schema: In case of Apache Pig, there is no need for creating a schema for the data loading related work. • For SQL like functions: It has many functions related to SQL along with the cogroup function.
  • 13. When to Use SQL  SQL is a general purpose database management language used around the globe. It has been updating itself as per the user expectations for decades. It is declarative and hence focuses explicitly on ‘what’ is needed.  It is popularly used for the transactional as well as analytical queries. When the requirements are not too demanding, SQL works as an excellent tool. Here are few scenarios – • For better performance: SQL is famous for its ability to pull data quickly and frequently. It supports OLAP (Online Analytical Processing) applications and performs better for these applications. Hive is slow in case of online transactional needs. • When the datasets are small: SQL works well with small datasets and performs much better for smaller amounts of data. It also has many ways for the optimisation of data. • For frequent data manipulation: If your requirement needs frequent modification in records or you need to update a large number of records frequently, SQL can perform these activities well. SQL also provides an entirely interactive experience to the user.
  • 14.
  • 15. APACHE PIG RUN MODES  Apache Pig executes in two modes: Local Mode and MapReduce Mode. Local Mode • It executes in a single JVM and is used for development experimenting and prototyping. • Here, files are installed and run using localhost. • The local mode works on a local file system. The input and output data stored in the local file system.  The command for local mode grunt shell: 1.$ pig-x local MapReduce Mode • The MapReduce mode is also known as Hadoop Mode. • It is the default mode. • In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. • It can be executed against semi-distributed or fully distributed Hadoop installation. • Here, the input and output data are present on HDFS.  The command for Map reduce mode:
  • 16. WAYS TO EXECUTE PIG PROGRAM These are the following ways of executing a Pig program on local and MapReduce mode: - • Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the command line. • Batch Mode - In this mode, we can run a script file having a .pig extension. These files contain Pig Latin commands. • Embedded Mode - In this mode, we can define our own functions. These functions can be called as UDF (User Defined Functions). Here, we use programming languages like Java and Python.
  • 18. The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data. To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below.
  • 19. APACHE PIG COMPONENTS As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components. Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
  • 20. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 21. PIG LATIN DATA MODEL The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
  • 22. An atomic value is one that is indivisible within the context of a database field definition (e.g. integer, real, code of some sort etc.) Field values that are not atomic are of two undesirable types (Elmasri & Navathe 1989 p.139,41): Undesirable - non atomic field types: Composite. Multivalued.
  • 23. Atom Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Example − ‘raja’ or ‘30’ Tuple A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Example − (Raja, 30)
  • 24. Bag A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.  It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Raja, 30), (Mohammad, 45)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338, raja@gmail.com,}}
  • 25. Map A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30] Relation A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).
  • 26. Grunt shell is a shell command. The Grunts shell of Apace pig is mainly used to write pig Latin scripts. Pig script can be executed with grunt shell which is native shell provided by Apache pig to execute pig queries. We can invoke shell commands using sh and fs.
  • 27. JOB EXECUTION FLOW The developer creates the scripts, and then it goes to the local file system as functions. Moreover, when the developers submit Pig Script, it contacts with Pig Latin Compiler. The compiler then splits the task and run a series of MR jobs. Meanwhile, Pig Compiler retrieves data from the HDFS. The output file again goes to the HDFS after running MR jobs.
  • 28. a. Pig Execution Modes We can run Pig in two execution modes. These modes depend upon where the Pig script is going to run. It also depends on where the data is residing. We can thus store data on a single machine or in a distributed environment like Clusters. The three different modes to run Pig programs are: Non-interactive shell or script mode- The user has to create a file, load the code and execute the script.  Then comes the Grunt shell or interactive shell for running Apache Pig commands. Hence, the last one named as embedded mode, which we can use JDBC to run SQL programs from Java.
  • 29. b. Pig Local mode However, in this mode, pig implements on single JVM and access the file system. This mode is better for dealing with the small data sets. Meanwhile, the parallel mapper execution is impossible. The older version of the Hadoop is not thread-safe. While the user can provide –x local to get into Pig local mode of execution. Therefore, Pig always looks for the local file system path while loading data.
  • 30. c. Pig Map Reduce Mode In this mode, a user could have proper Hadoop cluster setup and installations on it. By default, Apache Pig installs as in MR mode. The Pig also translates the queries into Map reduce jobs and runs on top of Hadoop cluster. Hence, this mode as a Map reduce runs on a distributed cluster. The statements like LOAD, STORE read the data from the HDFS file system and to show output. These Statements are also used to
  • 31. d. Storing Results The intermediate data generates during the processing of MR jobs. Pig stores this data in a non-permanent location on HDFS storage. The temporary location then created inside HDFS for storing this intermediate data. We can use DUMP for getting the final results to the output screen. The output results stored using STORE operator.
  • 32. Type Description int 4-byte integer long 8-byte integer float 4-byte (single precision) floating point double 8-byte (double precision) floating point bytearray Array of bytes; blob chararray String (“hello world”) boolean True/False (case insensitive) datetime A date and time biginteger Java BigInteger bigdecimal Java BigDecimal
  • 33. Type Description Tuple Ordered set of fields (a “row / record”) Bag Collection of tuples (a “resultset / table”) Map A set of key-value pairs Keys must be of type chararray
  • 34. BinStorage Loads and stores data in machine-readable (binary) format PigStorage Loads and stores data as structured, field delimited text files TextLoader Loads unstructured data in UTF-8 format PigDump Stores data in UTF-8 format YourOwnFormat! via UDFs
  • 35. Loads data from an HDFS file var = LOAD 'employees.txt'; var = LOAD 'employees.txt' AS (id, name, salary); var = LOAD 'employees.txt' using PigStorage() AS (id, name, salary); Each LOAD statement defines a new bag Each bag can have multiple elements (atoms) Each element can be referenced by name or position ($n) A bag is immutable A bag can be aliased and referenced later
  • 36. STORE Writes output to an HDFS file in a specified directory grunt> STORE processed INTO 'processed_txt';  Fails if directory exists  Writes output files, part-[m|r]-xxxxx, to the directory PigStorage can be used to specify a field delimiter DUMP Write output to screen grunt> DUMP processed;
  • 37. FOREACH Applies expressions to every record in a bag FILTER Filters by expression GROUP Collect records with the same key ORDER BY Sorting DISTINCT Removes duplicates
  • 38. Use the FILTER operator to restrict tuples or rows of data Basic syntax: alias2 = FILTER alias1 BY expression; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = FILTER alias1 BY (col1 == 8) OR (NOT (col2+col3 > col1)); DUMP alias2; (4,2,1) (8,3,4) (7,2,5) (8,4,3)
  • 39.  Use the GROUP…ALL operator to group data  Use GROUP when only one relation is involved  Use COGROUP with multiple relations are involved  Basic syntax: alias2 = GROUP alias1 ALL;  Example: DUMP alias1; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F) alias2 = GROUP alias1 BY col2; DUMP alias2; (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)}) (20,{(Bill,20,3.9F)})
  • 40. Use the ORDER…BY operator to sort a relation based on one or more fields Basic syntax: alias = ORDER alias BY field_alias [ASC|DESC]; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = ORDER alias1 BY col3 DESC; DUMP alias2; (7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
  • 41. Use the DISTINCT operator to remove duplicate tuples in a relation. Basic syntax: alias2 = DISTINCT alias1; Example: DUMP alias1; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) alias2= DISTINCT alias1; DUMP alias2; (8,3,4) (1,2,3) (4,3,3)
  • 42. FLATTEN Used to un-nest tuples as well as bags INNER JOIN Used to perform an inner join of two or more relations based on common field values OUTER JOIN Used to perform left, right or full outer joins SPLIT Used to partition the contents of a relation into two or more relations SAMPLE Used to select a random data sample with the stated sample size
  • 43. Use the JOIN operator to perform an inner, equi- join join of two or more relations based on common field values The JOIN operator always performs an inner join Inner joins ignore null keys Filter null keys before the join JOIN and COGROUP operators perform similar functions  JOIN creates a flat set of output records COGROUP creates a nested set of output records
  • 44. DUMP Alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) DUMP Alias2; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) Join Alias1 by Col1 to Alias2 by Col1 Alias3 = JOIN Alias1 BY Col1, Alias2 BY Col1; Dump Alias3; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)
  • 45. Use the OUTER JOIN operator to perform left, right, or full outer joins  Pig Latin syntax closely adheres to the SQL standard The keyword OUTER is optional  keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins respectively Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas Outer joins will only work for two-way joins  To perform a multi-way outer join perform multiple two-way outer join statements
  • 46. Natively written in Java, packaged as a jar file Other languages include Jython, JavaScript, Ruby, Groovy, and Python Register the jar with the REGISTER statement Optionally, alias it with the DEFINE statement REGISTER /src/myfunc.jar; A = LOAD 'students'; B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
  • 47. DEFINE can be used to work with UDFs and also streaming commands Useful when dealing with complex input/output formats /* read and write comma-delimited data */ DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(',')) OUTPUT(stdout USING PigStreaming(',')); A = STREAM X THROUGH Y; /* Define UDFs to a more readable format */ DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX; A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double); B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2); DUMP B;