SlideShare a Scribd company logo
1 of 30
Faculty Name: Namrata
Sharma/Arjun S. Parihar
Year/Branch:3rd/CSE
Subject Code:CS-503(A)
Subject Name:Data Analytics
In this session you will learn about:
Apache pig Vs MapReduce
Introduction to pig
Twitter Case Study
Pig Architecture
Installing and Running Pig
User define fuction
Learning Objectives
As we mentioned in our Hadoop Ecosystem blog, Apache Pig is an essential part of our Hadoop
ecosystem.
Before starting with the Apache Pig.
question
“while MapReduce was there for Big Data Analytics why Apache Pig came into
picture?“
approximately 10 lines of Pig code is equal to 200 lines of MapReduce code.
Programmers face difficulty writing MapReduce tasks as it requires Java or Python
programming knowledge. For them, Apache Pig is a savior.
Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data
processing paradigm.
Without writing complex Java implementations in MapReduce, programmers can achieve
the same implementations very easily using Pig Latin.
Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can
accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times.
Hence, this reduces the development period by almost 16 times.
Apache Pig vs MapReduce
Pig provides many built-in operators to support data operations like joins, filters, ordering,
sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce
to perform a Join operation between the data sets, as it requires multiple MapReduce tasks
to be executed sequentially to fulfill the job.
In addition, it also provides nested data types like tuples, bags, and maps that are missing
from MapReduce.
Apache Pig is a platform, used to analyze large data sets representing them as data flows. It
is designed to provide an abstraction over MapReduce, reducing the complexities of writing
a MapReduce program. We can perform data manipulation operations very easily in Hadoop
using Apache Pig.
The features of Apache pig are:
Pig enables programmers to write complex data transformations without knowing Java.
Apache Pig has two main components – the Pig Latin language and the Pig Run-time
Environment, in which Pig Latin programs are executed.
For Big Data Analytics, Pig gives a simple data flow language known as Pig Latin which
has functionalities similar to SQL like join, filter, limit etc.
Introduction to Apache Pig
Developers who are working with scripting languages and SQL, leverages Pig Latin. This
gives developers ease of programming with Apache Pig. Pig Latin provides various built-in
operators like join, sort, filter, etc to read, write, and process large data sets. Thus it is
evident, Pig has a rich set of operators.
Programmers write scripts using Pig Latin to analyze data and these scripts are internally
converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writing
MapReduce tasks was the only way to process the data stored in HDFS.
If a programmer wants to write custom functions which is unavailable in Pig, Pig allows
them to write User Defined Functions (UDF) in any language of their choice like Java,
Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This
provides extensibility to Apache Pig.
Pig can process any kind of data, i.e. structured, semi-structured or unstructured data,
coming from various sources. Apache Pig handles all kinds of data.
Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code.
It can handle inconsistent schema (in case of unstructured data).
Apache Pig extracts the data, performs operations on that data and dumps the data in the
required format in HDFS i.e. ETL (Extract Transform Load).
Apache Pig automatically optimizes the tasks before execution, i.e. automatic
optimization.
It allows programmers and developers to concentrate upon the whole operation irrespective
of creating mapper and reducer functions separately.
Apache Pig is used for analyzing and performing tasks involving ad-hoc processing.
Apache Pig is used:
Where we need to process, huge data sets like Web logs, streaming online data, etc.
Where we need Data processing for search platforms (different types of data needs to be
processed) like Yahoo uses Pig for 40% of their jobs including news feeds and search
engine.
Where we need to process time sensitive data loads. Here, data needs to be extracted and
analyzed quickly. E.g. machine learning algorithms requires time sensitive data loads, like
twitter needs to quickly extract data of customer activities (i.e. tweets, re-tweets and likes) and
analyze the data to find patterns in customer behaviors, and make recommendations
immediately like trending tweets.
Where to use Apache Pig?
case study of Twitter where Twitter adopted Apache Pig.
Twitter’s data was growing at an accelerating rate (i.e. 10 TB data/day). Thus, Twitter decided
to move the archived data to HDFS and adopt Hadoop for extracting the business values out
of it. Their major aim was to analyse data stored in Hadoop to come up with the following
insights on a daily, weekly or monthly basis.
Counting operations:
How many requests twitter serve in a day?
What is the average latency of the requests?
How many searches happens each day on Twitter?
How many unique queries are received?
How many unique users come to visit?
What is the geographic distribution of the users?
Apache Pig Tutorial: Twitter Case Study
Correlating Big Data:
How usage differs for mobile users?
Cohort analysis: analyzing data by categorizing user, based on their behavior.
What goes wrong while site problem occurs?
Which features user often uses?
Search correction and search suggestions.
Research on Big Data & produce better outcomes like:
What can Twitter analysis about users from their tweets?
Who follows whom and on what basis?
What is the ratio of the follower to following?
What is the reputation of the user?
So, Twitter moved to Apache Pig for analysis.
Now, joining data sets, grouping them, sorting them and retrieving data becomes
easier and simpler.
You can see in the below image how twitter used Apache Pig to analyse their large
data set.
Twitter had both semi-structured data like Twitter Apache logs, Twitter search logs, Twitter
MySQL query logs, application logs and structured data like tweets, users, block
notifications, phones, favorites, saved searches, re-tweets, authentications, SMS usage,
user followings, etc. which can be easily processed by Apache Pig.
Twitter dumps all its archived data on HDFS.
It has two tables i.e. user data and tweets data.
User data contains information about the users like username, followers, followings, number
of tweets etc.
While Tweet data contains tweet, its owner, number of re-tweets, number of likes etc. Now,
twitter uses this data to analyse their customer’s behaviors and improve their past
experiences.
Analyzing how many tweets are stored per user, in the given tweet tables?
STEP 1– First of all, twitter imports the twitter tables (i.e. user table and tweet table) into the HDFS.
STEP 2– Then Apache Pig loads (LOAD) the tables into Apache Pig framework.
STEP 3– Then it joins and groups the tweet tables and user table using COGROUP command as shown
in the above image.
This results in the inner Bag Data type, which we will discuss later in this blog.
Example of Inner bags produced (refer to the above image) –
(1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})
(2,{(2,Ellie,abc),(2,Ellie,vxy)})
(3, {(3,Sam,stu)})
STEP 4– Then the tweets are counted according to the users using COUNT command. So, that the total
number of tweets per user can be easily calculated.
Example of tuple produced as (id, tweet count) (refer to the above image) –
(1, 3)
(2, 2)
(3, 1)
STEP 5– At last the result is joined with user table to extract the user name with produced result.
Example of tuple produced as (id, name, tweet count) (refer to the above image) –
(1, Jay, 3)
(2, Ellie, 2)
(3, Sam, 1)
STEP 6– Finally, this result is stored back in the HDFS.
The step by step solution of this problem is shown in the above image.
For writing a Pig script, we need Pig Latin language and to execute them, we need an execution
environment. The architecture of Apache Pig is shown in the below image.
Architecture
This is Pig’s interactive shell provided
to execute all Pig Scripts.
Write all the Pig commands in a
script file and execute the Pig
script file. This is executed by
the Pig Server.
Pig Scripts are passed to the Parser.
The Parser does type checking and
checks the syntax of the script.
DAG (directed acyclic graph
DAG is submitted to the optimizer. The
Optimizer performs the optimization activities
like split, merge, transform, and reorder
operators etc. This optimizer provides the
automatic optimization feature to Apache Pig.
compiler is the one who is responsible
for converting Pig jobs automatically into
MapReduce jobs.
Apache Pig Installation on Linux:
Below are the steps for Apache Pig Installation on Linux (ubuntu/centos/windows using
Linux VM). I am using Ubuntu 16.04 in below setup.
Step 1: Download Pig tar file.
Command: wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
Step 2: Extract the tar file using tar command.
In below tar command, x means extract an archive file, z means filter an archive through gzip, f means
filename of an archive file.
Command: tar -xzf pig-0.16.0.tar.gz
Command: ls
Step 3: Edit the “.bashrc” file to update the environment variables of Apache Pig.
We are setting it so that we can access pig from any directory, we need not go to pig directory to
execute pig commands. Also, if any other application is looking for Pig, it will get to know the path of
Apache Pig from this file.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set PIG_HOME
export PIG_HOME=/home/edureka/pig-0.16.0
export PATH=$PATH:/home/edureka/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
Also, make sure that hadoop path is also set.
Run below command to make the changes get updated in
same terminal.
Command: source .bashrc
Step 4: Check pig version.
Command: pig -version
Step 5: Check pig help to see all the pig command options.
Command: pig -help
Step 6: Run Pig to start the grunt shell. Grunt shell is used to run Pig Latin scripts.
Command: pig
If you look at the above image correctly, Apache Pig has two modes in which it can run, by
default it chooses MapReduce mode. The other mode in which you can run Pig is Local
mode.
MapReduce Mode – This is the default mode, which requires access to a Hadoop cluster and HDFS
installation. Since, this is a default mode, it is not necessary to specify -x flag ( you can
execute pig OR pig -x mapreduce). The input and output in this mode are present on HDFS.
Local Mode – With access to a single machine, all files are installed and run using a local host and
file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this
mode are present on local file system.
Command: pig -x local
Execution modes in Apache Pig
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter, we are
going to discuss the basics of Pig Latin such as Pig Latin statements, data types, general and relational
operators, and Pig Latin UDF’s.
Pig Latin – Data Model
As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the
outermost structure of the Pig Latin data model. And it is a bag where −
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
Pig Latin
Pig Latin – Statements
While processing data using Pig Latin, statements are the basic constructs.
These statements work with relations. They include expressions and schemas.
Every statement ends with a semicolon (;).
We will perform various operations using operators provided by Pig Latin, through statements.
Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation
as input and produce another relation as output.
As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To
see the contents of the schema, you need to use the Dump operator. Only after performing
the dump operation, the MapReduce job for loading the data into the file system will be carried out.
S.N. Data Type Description & Example
1 int Represents a signed 32-bit integer.
Example : 8
2 long Represents a signed 64-bit integer.
Example : 5L
3 float Represents a signed 32-bit floating point.
Example : 5.5F
4 double Represents a 64-bit floating point.
Example : 10.5
5 chararray Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’
6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.
Example : true/ false.
8 Datetime Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00
9 Biginteger Represents a Java BigInteger.
Example : 60708090709
Pig Latin – Data types
Operator Description Example
() Tuple constructor operator −
This operator is used to
construct a tuple.
(Raju, 30)
{} Bag constructor operator −
This operator is used to
construct a bag.
{(Raju, 30), (Mohammad, 45)}
[] Map constructor operator −
This operator is used to
construct a tuple.
[name#Raja, age#30]
Pig Latin – Type Construction Operators
Pig Latin – Relational Operations
Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution
plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of
statements.
Thanking you

More Related Content

Similar to lecturte 5. Hgfjhffjyy to the data will be 1.ppt

IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET Journal
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkIRJET Journal
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 

Similar to lecturte 5. Hgfjhffjyy to the data will be 1.ppt (20)

IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 
Apache pig
Apache pig Apache pig
Apache pig
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Paper ijert
Paper ijertPaper ijert
Paper ijert
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 

Recently uploaded

Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk GurgaonCheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk GurgaonDelhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceCALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceanilsa9823
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricksabhishekparmar618
 
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Narsimha murthy
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryWilliamVickery6
 
MASONRY -Building Technology and Construction
MASONRY -Building Technology and ConstructionMASONRY -Building Technology and Construction
MASONRY -Building Technology and Constructionmbermudez3
 
The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentationamedia6
 
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiVIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️soniya singh
 
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptxVanshNarang19
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Delhi Call girls
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...Amil baba
 
Kindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUpKindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUpmainac1
 
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 night
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 nightCheap Rate Call girls Kalkaji 9205541914 shot 1500 night
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 nightDelhi Call girls
 
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service BhiwandiVIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service BhiwandiSuhani Kapoor
 
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵anilsa9823
 

Recently uploaded (20)

Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk GurgaonCheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
 
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceCALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricks
 
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William Vickery
 
MASONRY -Building Technology and Construction
MASONRY -Building Technology and ConstructionMASONRY -Building Technology and Construction
MASONRY -Building Technology and Construction
 
The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentation
 
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiVIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
 
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
 
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
 
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptx
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
 
escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974
escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974
escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
 
Kindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUpKindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUp
 
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 night
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 nightCheap Rate Call girls Kalkaji 9205541914 shot 1500 night
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 night
 
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service BhiwandiVIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
 
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service  🧵
CALL ON ➥8923113531 🔝Call Girls Kalyanpur Lucknow best Female service 🧵
 

lecturte 5. Hgfjhffjyy to the data will be 1.ppt

  • 1.
  • 2. Faculty Name: Namrata Sharma/Arjun S. Parihar Year/Branch:3rd/CSE Subject Code:CS-503(A) Subject Name:Data Analytics
  • 3. In this session you will learn about: Apache pig Vs MapReduce Introduction to pig Twitter Case Study Pig Architecture Installing and Running Pig User define fuction Learning Objectives
  • 4. As we mentioned in our Hadoop Ecosystem blog, Apache Pig is an essential part of our Hadoop ecosystem. Before starting with the Apache Pig. question “while MapReduce was there for Big Data Analytics why Apache Pig came into picture?“ approximately 10 lines of Pig code is equal to 200 lines of MapReduce code.
  • 5. Programmers face difficulty writing MapReduce tasks as it requires Java or Python programming knowledge. For them, Apache Pig is a savior. Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm. Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin. Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times. Hence, this reduces the development period by almost 16 times. Apache Pig vs MapReduce
  • 6. Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task. Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
  • 7. Apache Pig is a platform, used to analyze large data sets representing them as data flows. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce program. We can perform data manipulation operations very easily in Hadoop using Apache Pig. The features of Apache pig are: Pig enables programmers to write complex data transformations without knowing Java. Apache Pig has two main components – the Pig Latin language and the Pig Run-time Environment, in which Pig Latin programs are executed. For Big Data Analytics, Pig gives a simple data flow language known as Pig Latin which has functionalities similar to SQL like join, filter, limit etc. Introduction to Apache Pig
  • 8. Developers who are working with scripting languages and SQL, leverages Pig Latin. This gives developers ease of programming with Apache Pig. Pig Latin provides various built-in operators like join, sort, filter, etc to read, write, and process large data sets. Thus it is evident, Pig has a rich set of operators. Programmers write scripts using Pig Latin to analyze data and these scripts are internally converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writing MapReduce tasks was the only way to process the data stored in HDFS. If a programmer wants to write custom functions which is unavailable in Pig, Pig allows them to write User Defined Functions (UDF) in any language of their choice like Java, Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This provides extensibility to Apache Pig.
  • 9. Pig can process any kind of data, i.e. structured, semi-structured or unstructured data, coming from various sources. Apache Pig handles all kinds of data. Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code. It can handle inconsistent schema (in case of unstructured data). Apache Pig extracts the data, performs operations on that data and dumps the data in the required format in HDFS i.e. ETL (Extract Transform Load). Apache Pig automatically optimizes the tasks before execution, i.e. automatic optimization. It allows programmers and developers to concentrate upon the whole operation irrespective of creating mapper and reducer functions separately.
  • 10. Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used: Where we need to process, huge data sets like Web logs, streaming online data, etc. Where we need Data processing for search platforms (different types of data needs to be processed) like Yahoo uses Pig for 40% of their jobs including news feeds and search engine. Where we need to process time sensitive data loads. Here, data needs to be extracted and analyzed quickly. E.g. machine learning algorithms requires time sensitive data loads, like twitter needs to quickly extract data of customer activities (i.e. tweets, re-tweets and likes) and analyze the data to find patterns in customer behaviors, and make recommendations immediately like trending tweets. Where to use Apache Pig?
  • 11. case study of Twitter where Twitter adopted Apache Pig. Twitter’s data was growing at an accelerating rate (i.e. 10 TB data/day). Thus, Twitter decided to move the archived data to HDFS and adopt Hadoop for extracting the business values out of it. Their major aim was to analyse data stored in Hadoop to come up with the following insights on a daily, weekly or monthly basis. Counting operations: How many requests twitter serve in a day? What is the average latency of the requests? How many searches happens each day on Twitter? How many unique queries are received? How many unique users come to visit? What is the geographic distribution of the users? Apache Pig Tutorial: Twitter Case Study
  • 12. Correlating Big Data: How usage differs for mobile users? Cohort analysis: analyzing data by categorizing user, based on their behavior. What goes wrong while site problem occurs? Which features user often uses? Search correction and search suggestions. Research on Big Data & produce better outcomes like: What can Twitter analysis about users from their tweets? Who follows whom and on what basis? What is the ratio of the follower to following? What is the reputation of the user?
  • 13. So, Twitter moved to Apache Pig for analysis. Now, joining data sets, grouping them, sorting them and retrieving data becomes easier and simpler. You can see in the below image how twitter used Apache Pig to analyse their large data set.
  • 14. Twitter had both semi-structured data like Twitter Apache logs, Twitter search logs, Twitter MySQL query logs, application logs and structured data like tweets, users, block notifications, phones, favorites, saved searches, re-tweets, authentications, SMS usage, user followings, etc. which can be easily processed by Apache Pig. Twitter dumps all its archived data on HDFS. It has two tables i.e. user data and tweets data. User data contains information about the users like username, followers, followings, number of tweets etc. While Tweet data contains tweet, its owner, number of re-tweets, number of likes etc. Now, twitter uses this data to analyse their customer’s behaviors and improve their past experiences.
  • 15. Analyzing how many tweets are stored per user, in the given tweet tables?
  • 16. STEP 1– First of all, twitter imports the twitter tables (i.e. user table and tweet table) into the HDFS. STEP 2– Then Apache Pig loads (LOAD) the tables into Apache Pig framework. STEP 3– Then it joins and groups the tweet tables and user table using COGROUP command as shown in the above image. This results in the inner Bag Data type, which we will discuss later in this blog. Example of Inner bags produced (refer to the above image) – (1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)}) (2,{(2,Ellie,abc),(2,Ellie,vxy)}) (3, {(3,Sam,stu)}) STEP 4– Then the tweets are counted according to the users using COUNT command. So, that the total number of tweets per user can be easily calculated. Example of tuple produced as (id, tweet count) (refer to the above image) – (1, 3) (2, 2) (3, 1) STEP 5– At last the result is joined with user table to extract the user name with produced result. Example of tuple produced as (id, name, tweet count) (refer to the above image) – (1, Jay, 3) (2, Ellie, 2) (3, Sam, 1) STEP 6– Finally, this result is stored back in the HDFS. The step by step solution of this problem is shown in the above image.
  • 17. For writing a Pig script, we need Pig Latin language and to execute them, we need an execution environment. The architecture of Apache Pig is shown in the below image. Architecture This is Pig’s interactive shell provided to execute all Pig Scripts. Write all the Pig commands in a script file and execute the Pig script file. This is executed by the Pig Server. Pig Scripts are passed to the Parser. The Parser does type checking and checks the syntax of the script. DAG (directed acyclic graph DAG is submitted to the optimizer. The Optimizer performs the optimization activities like split, merge, transform, and reorder operators etc. This optimizer provides the automatic optimization feature to Apache Pig. compiler is the one who is responsible for converting Pig jobs automatically into MapReduce jobs.
  • 18. Apache Pig Installation on Linux: Below are the steps for Apache Pig Installation on Linux (ubuntu/centos/windows using Linux VM). I am using Ubuntu 16.04 in below setup. Step 1: Download Pig tar file. Command: wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
  • 19. Step 2: Extract the tar file using tar command. In below tar command, x means extract an archive file, z means filter an archive through gzip, f means filename of an archive file. Command: tar -xzf pig-0.16.0.tar.gz Command: ls
  • 20. Step 3: Edit the “.bashrc” file to update the environment variables of Apache Pig. We are setting it so that we can access pig from any directory, we need not go to pig directory to execute pig commands. Also, if any other application is looking for Pig, it will get to know the path of Apache Pig from this file. Command: sudo gedit .bashrc Add the following at the end of the file: # Set PIG_HOME export PIG_HOME=/home/edureka/pig-0.16.0 export PATH=$PATH:/home/edureka/pig-0.16.0/bin export PIG_CLASSPATH=$HADOOP_CONF_DIR Also, make sure that hadoop path is also set. Run below command to make the changes get updated in same terminal. Command: source .bashrc
  • 21. Step 4: Check pig version. Command: pig -version Step 5: Check pig help to see all the pig command options. Command: pig -help
  • 22. Step 6: Run Pig to start the grunt shell. Grunt shell is used to run Pig Latin scripts. Command: pig If you look at the above image correctly, Apache Pig has two modes in which it can run, by default it chooses MapReduce mode. The other mode in which you can run Pig is Local mode.
  • 23. MapReduce Mode – This is the default mode, which requires access to a Hadoop cluster and HDFS installation. Since, this is a default mode, it is not necessary to specify -x flag ( you can execute pig OR pig -x mapreduce). The input and output in this mode are present on HDFS. Local Mode – With access to a single machine, all files are installed and run using a local host and file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this mode are present on local file system. Command: pig -x local Execution modes in Apache Pig
  • 24. Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter, we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types, general and relational operators, and Pig Latin UDF’s. Pig Latin – Data Model As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin data model. And it is a bag where − A bag is a collection of tuples. A tuple is an ordered set of fields. A field is a piece of data. Pig Latin
  • 25. Pig Latin – Statements While processing data using Pig Latin, statements are the basic constructs. These statements work with relations. They include expressions and schemas. Every statement ends with a semicolon (;). We will perform various operations using operators provided by Pig Latin, through statements. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will be carried out.
  • 26. S.N. Data Type Description & Example 1 int Represents a signed 32-bit integer. Example : 8 2 long Represents a signed 64-bit integer. Example : 5L 3 float Represents a signed 32-bit floating point. Example : 5.5F 4 double Represents a 64-bit floating point. Example : 10.5 5 chararray Represents a character array (string) in Unicode UTF-8 format. Example : ‘tutorials point’ 6 Bytearray Represents a Byte array (blob). 7 Boolean Represents a Boolean value. Example : true/ false. 8 Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00 9 Biginteger Represents a Java BigInteger. Example : 60708090709 Pig Latin – Data types
  • 27. Operator Description Example () Tuple constructor operator − This operator is used to construct a tuple. (Raju, 30) {} Bag constructor operator − This operator is used to construct a bag. {(Raju, 30), (Mohammad, 45)} [] Map constructor operator − This operator is used to construct a tuple. [name#Raja, age#30] Pig Latin – Type Construction Operators
  • 28. Pig Latin – Relational Operations Operator Description Loading and Storing LOAD To Load the data from the file system (local/HDFS) into a relation. STORE To save a relation to the file system (local/HDFS). Filtering FILTER To remove unwanted rows from a relation. DISTINCT To remove duplicate rows from a relation. FOREACH, GENERATE To generate data transformations based on columns of data. STREAM To transform a relation using an external program. Grouping and Joining JOIN To join two or more relations. COGROUP To group the data in two or more relations. GROUP To group the data in a single relation. CROSS To create the cross product of two or more relations.
  • 29. Sorting ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending). LIMIT To get a limited number of tuples from a relation. Combining and Splitting UNION To combine two or more relations into a single relation. SPLIT To split a single relation into two or more relations. Diagnostic Operators DUMP To print the contents of a relation on the console. DESCRIBE To describe the schema of a relation. EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation. ILLUSTRATE To view the step-by-step execution of a series of statements.