SlideShare a Scribd company logo
1 of 17
Hadoop Developer
Training
Session 04 - PIG
Page 1Classification: Restricted
Agenda
• PIG
• PIG - Overview
• Installation and Running Pig
• Load in Pig
• Macros in Pig
Page 2Classification: Restricted
What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which
is used to analyze larger sets of data representing them as data flows.
Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language
known as Pig Latin. This language provides various operators using which
programmers can develop their own functions for reading, writing, and
processing data.
• To analyze data using Apache Pig, programmers need to write scripts
using Pig Latin language. All these scripts are internally converted to Map
and Reduce tasks. Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Page 3Classification: Restricted
Installation and Running Pig
• PIG is a data analytics framework in Hadoop ecosystem. It could be installed
in following ways:
• Local Mode: It is installed on a single machine and use local file system for
storage. Local mode is used mainly for debugging and testing Pig Latin
scripts. Specify ‘local’ as an argument to pig to start it in local mode.
• MapReduce Mode: This is the default installation of Pig. It requires Hadoop
cluster configured to load files in HDFS and run MR using Pig Latin scripts.
Follow the steps below to configure PIG in your environment:
Download the latest version of pig from here.
Untar the pig tar using command
tar –xzvf pig-0.13.0.tar.gz
Create a folder in /opt for pig installation
mkdir /opt/pig
Move the untar file to its folder in opt
mv /opt/setups/pig-0.13.1 /opt/pig
Open .bash_profile
nano ~/.bash_profile
Page 4Classification: Restricted
Installation and Running Pig
Move the untar file to its folder in opt
mv /opt/setups/pig-0.13.1 /opt/pig
Open .bash_profile
nano ~/.bash_profile
• Add pig to classpath
export PIG_HOME=/opt/pig/pig-0.13.1
export PATH=$PATH:$PIG_HOME/bin
Test for successful pig installation
pig -h
Start pig in local mode
pig -x local
If data from hdfs
pig –x mapreduce
Page 5Classification: Restricted
Installation and Running Pig
In general, Apache Pig works on top of Hadoop. It is an analytical tool that
analyzes large datasets that exist in the Hadoop File System. To analyze data
using Apache Pig, we have to initially load the data into Apache Pig. This
chapter explains how to load data to Apache Pig from HDFS.
Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores the
results back in HDFS. Therefore, let us start HDFS and create the following
sample data in HDFS.
The above dataset contains personal details like id, first name, last name,
phone number and city, of six students
Student ID First Name Last Name Phone City
1 Peter Burke
4353521729 Salt Lake City
2 Aaron Kimberlake
8013528191 Salt Lake City
3 Danny Jacob
2958295582 Salt Lake City
4 Angela Kouth
2938811911 Salt Lake City
5 Peggy Karter
3202289119 Salt Lake City
Page 6Classification: Restricted
Installation and Running Pig
Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir
The input file of Pig contains each tuple/record in individual lines. And the
entities of the record are separated by a delimiter (In our example we used
“,”).
In the local file system, create an input file student_data.txt containing data
as shown below.
1, Peter, Burke, 4353521729, Salt Lake City
2, Aaron, Kimberlake, 8013528191, Salt Lake City
3, Danny, Jacob, 2958295582, Salt Lake City
4, Angela, Kouth, 2938811911, Salt Lake City
5, Peggy, Karter, 3202289119, Salt Lake City
Now, move the file from the local file system to HDFS using put command
Page 7Classification: Restricted
Load in Pig
The load statement will simply load the data into the specified relation in
Pig.
The load statement consists of two parts divided by the “=” operator. On
the left-hand side, we need to mention the name of the relation where we
want to store the data, and on the right-hand side, we have to define how
we store the data. Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Schema − We have to define the schema of the data. We can define the
required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Now load the data from the file student_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Page 8Classification: Restricted
Load in Pig
We have used the PigStorage() function. It loads and stores data as
structured text files. It takes a delimiter using which each entity of a tuple is
separated, as a parameter. By default, it takes ‘t’ as a parameter.
In the previous chapter, we learnt how to load data into Apache Pig. You can
store the loaded data in the file system using the store operator.
A stored file is only obtained in mapreduce(hdfs) mode of pig
STORE Relation_name INTO ' required_directory_path ' [USING function];
student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');
Page 9Classification: Restricted
Load in Pig
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the
results on the screen.
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
grunt> Dump student
Describe Operator
The describe operator is used to view the schema of a relation
grunt> Describe Relation_name
grunt> describe student;
Page 10Classification: Restricted
Load in Pig
Output
Once you execute the above Pig Latin statement, it will produce the
following output.
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone:
chararray,city: chararray }
The illustrate operator gives you the step-by-step execution of a sequence
of statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Now, let us illustrate the relation named student as shown below.
grunt> illustrate student
Page 11Classification: Restricted
Load in Pig
Assume that we have a file named student_details.txt in the HDFS directory
/pig_data/ as shown below.
student_details.txt
1, Peter, Burke, 4353521729, Salt Lake City
2, Aaron, Kimberlake, 8013528191, Salt Lake City
3, Danny, Jacob, 2958295582, Salt Lake City
4, Angela, Kouth, 2938811911, Salt Lake City
5, Peggy, Karter, 3202289119, Salt Lake City
6, King, Salmon, 2398329282, Salt Lake City
7, Carolyn, Fisher, 2293322829, Salt Lake City
8, John, Hopkins, 2102392020, Salt Lake City
grunt> student_details = LOAD '/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Page 12Classification: Restricted
Load in Pig
The GROUP operator is used to group the data in one or more relations. It
collects the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
grunt> group_data = GROUP student_details by age;
grunt> dump group_data
Page 13Classification: Restricted
Load in Pig
Output
Then you will get output displaying the contents of the relation named
group_data as shown below. Here you can observe that the resulting
schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with
the respective age.
(21,{(4, Angela, Kouth, 2938811911, Salt Lake City), (1, Peter, Burke,
4353521729, Salt Lake City)})
(22,{(3, Danny, Jacob, 2958295582, Salt Lake City), (2, Aaron, Kimberlake,
8013528191, Salt Lake City)})
(23,{(6, King, Salmon, 2398329282, Salt Lake City), (5, Peggy, Karter,
3202289119, Salt Lake City)})
(24,{(8, John, Hopkins, 2102392020, Salt Lake City), (7, Carolyn, Fisher,
2293322829, Salt Lake City)})
Page 14Classification: Restricted
Load in Pig
 customers.txt
 1, Peter, 32, Salt Lake City, 2000.00
 2, Aaron, 25, Salt Lake City, 1500.00
 3, Danny, 23, Salt Lake City, 2000.00
 4, Angela, 25, Salt Lake City, 6500.00
 5, Peggy, 27, Salt Lake City, 8500.00
 6, King, 22, Salt Lake City, 4500.00
 7, Carolyn, 24, Salt Lake City,10000.00
 orders.txt
 102,2009-10-08 00:00:00,3,3000
 100,2009-10-08 00:00:00,3,1500
 101,2009-11-20 00:00:00,2,1560
 103,2008-05-20 00:00:00,4,2060
Page 15Classification: Restricted
Topics to be covered in next session
PIG
• Loads in Pig Continued
• Verification
• Filters
• Macros in Pig
Page 16Classification: Restricted
Thank you!

More Related Content

What's hot

Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
 
Import and Export Big Data using R Studio
Import and Export Big Data using R StudioImport and Export Big Data using R Studio
Import and Export Big Data using R StudioRupak Roy
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQLRupak Roy
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive Rupak Roy
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collectionsvishal choudhary
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functionsRupak Roy
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet FormatYue Chen
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioRupak Roy
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 

What's hot (19)

Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Unit 2
Unit 2Unit 2
Unit 2
 
Import and Export Big Data using R Studio
Import and Export Big Data using R StudioImport and Export Big Data using R Studio
Import and Export Big Data using R Studio
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQL
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R Studio
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 

Similar to Session 04 pig - slides

An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Improving build solutions dependency management with webpack
Improving build solutions  dependency management with webpackImproving build solutions  dependency management with webpack
Improving build solutions dependency management with webpackNodeXperts
 
Case study ap log collector
Case study ap log collectorCase study ap log collector
Case study ap log collectorJyun-Yao Huang
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaionTejalNijai
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET Journal
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Osgis 2010 notes
Osgis 2010 notesOsgis 2010 notes
Osgis 2010 notesJoanne Cook
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 

Similar to Session 04 pig - slides (20)

Apache Pig
Apache PigApache Pig
Apache Pig
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Improving build solutions dependency management with webpack
Improving build solutions  dependency management with webpackImproving build solutions  dependency management with webpack
Improving build solutions dependency management with webpack
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
Pig
PigPig
Pig
 
Case study ap log collector
Case study ap log collectorCase study ap log collector
Case study ap log collector
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Osgis 2010 notes
Osgis 2010 notesOsgis 2010 notes
Osgis 2010 notes
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 

More from AnandMHadoop

Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn ConceptsAnandMHadoop
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 

More from AnandMHadoop (6)

Overview of Java
Overview of Java Overview of Java
Overview of Java
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn Concepts
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 

Recently uploaded

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Session 04 pig - slides

  • 2. Page 1Classification: Restricted Agenda • PIG • PIG - Overview • Installation and Running Pig • Load in Pig • Macros in Pig
  • 3. Page 2Classification: Restricted What is Apache Pig? • Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. • To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data. • To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 4. Page 3Classification: Restricted Installation and Running Pig • PIG is a data analytics framework in Hadoop ecosystem. It could be installed in following ways: • Local Mode: It is installed on a single machine and use local file system for storage. Local mode is used mainly for debugging and testing Pig Latin scripts. Specify ‘local’ as an argument to pig to start it in local mode. • MapReduce Mode: This is the default installation of Pig. It requires Hadoop cluster configured to load files in HDFS and run MR using Pig Latin scripts. Follow the steps below to configure PIG in your environment: Download the latest version of pig from here. Untar the pig tar using command tar –xzvf pig-0.13.0.tar.gz Create a folder in /opt for pig installation mkdir /opt/pig Move the untar file to its folder in opt mv /opt/setups/pig-0.13.1 /opt/pig Open .bash_profile nano ~/.bash_profile
  • 5. Page 4Classification: Restricted Installation and Running Pig Move the untar file to its folder in opt mv /opt/setups/pig-0.13.1 /opt/pig Open .bash_profile nano ~/.bash_profile • Add pig to classpath export PIG_HOME=/opt/pig/pig-0.13.1 export PATH=$PATH:$PIG_HOME/bin Test for successful pig installation pig -h Start pig in local mode pig -x local If data from hdfs pig –x mapreduce
  • 6. Page 5Classification: Restricted Installation and Running Pig In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig. This chapter explains how to load data to Apache Pig from HDFS. Preparing HDFS In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, let us start HDFS and create the following sample data in HDFS. The above dataset contains personal details like id, first name, last name, phone number and city, of six students Student ID First Name Last Name Phone City 1 Peter Burke 4353521729 Salt Lake City 2 Aaron Kimberlake 8013528191 Salt Lake City 3 Danny Jacob 2958295582 Salt Lake City 4 Angela Kouth 2938811911 Salt Lake City 5 Peggy Karter 3202289119 Salt Lake City
  • 7. Page 6Classification: Restricted Installation and Running Pig Create a Directory in HDFS In Hadoop DFS, you can create directories using the command mkdir The input file of Pig contains each tuple/record in individual lines. And the entities of the record are separated by a delimiter (In our example we used “,”). In the local file system, create an input file student_data.txt containing data as shown below. 1, Peter, Burke, 4353521729, Salt Lake City 2, Aaron, Kimberlake, 8013528191, Salt Lake City 3, Danny, Jacob, 2958295582, Salt Lake City 4, Angela, Kouth, 2938811911, Salt Lake City 5, Peggy, Karter, 3202289119, Salt Lake City Now, move the file from the local file system to HDFS using put command
  • 8. Page 7Classification: Restricted Load in Pig The load statement will simply load the data into the specified relation in Pig. The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator. Relation_name = LOAD 'Input file path' USING function as schema; Schema − We have to define the schema of the data. We can define the required schema as follows − (column1 : data type, column2 : data type, column3 : data type); Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell. student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
  • 9. Page 8Classification: Restricted Load in Pig We have used the PigStorage() function. It loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘t’ as a parameter. In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. A stored file is only obtained in mapreduce(hdfs) mode of pig STORE Relation_name INTO ' required_directory_path ' [USING function]; student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
  • 10. Page 9Classification: Restricted Load in Pig Dump Operator The Dump operator is used to run the Pig Latin statements and display the results on the screen. Given below is the syntax of the Dump operator. grunt> Dump Relation_Name grunt> Dump student Describe Operator The describe operator is used to view the schema of a relation grunt> Describe Relation_name grunt> describe student;
  • 11. Page 10Classification: Restricted Load in Pig Output Once you execute the above Pig Latin statement, it will produce the following output. grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray } The illustrate operator gives you the step-by-step execution of a sequence of statements. Syntax Given below is the syntax of the illustrate operator. grunt> illustrate Relation_name; Now, let us illustrate the relation named student as shown below. grunt> illustrate student
  • 12. Page 11Classification: Restricted Load in Pig Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 1, Peter, Burke, 4353521729, Salt Lake City 2, Aaron, Kimberlake, 8013528191, Salt Lake City 3, Danny, Jacob, 2958295582, Salt Lake City 4, Angela, Kouth, 2938811911, Salt Lake City 5, Peggy, Karter, 3202289119, Salt Lake City 6, King, Salmon, 2398329282, Salt Lake City 7, Carolyn, Fisher, 2293322829, Salt Lake City 8, John, Hopkins, 2102392020, Salt Lake City grunt> student_details = LOAD '/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
  • 13. Page 12Classification: Restricted Load in Pig The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. Syntax Given below is the syntax of the group operator. grunt> Group_data = GROUP Relation_name BY age; grunt> group_data = GROUP student_details by age; grunt> dump group_data
  • 14. Page 13Classification: Restricted Load in Pig Output Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns − One is age, by which we have grouped the relation. The other is a bag, which contains the group of tuples, student records with the respective age. (21,{(4, Angela, Kouth, 2938811911, Salt Lake City), (1, Peter, Burke, 4353521729, Salt Lake City)}) (22,{(3, Danny, Jacob, 2958295582, Salt Lake City), (2, Aaron, Kimberlake, 8013528191, Salt Lake City)}) (23,{(6, King, Salmon, 2398329282, Salt Lake City), (5, Peggy, Karter, 3202289119, Salt Lake City)}) (24,{(8, John, Hopkins, 2102392020, Salt Lake City), (7, Carolyn, Fisher, 2293322829, Salt Lake City)})
  • 15. Page 14Classification: Restricted Load in Pig  customers.txt  1, Peter, 32, Salt Lake City, 2000.00  2, Aaron, 25, Salt Lake City, 1500.00  3, Danny, 23, Salt Lake City, 2000.00  4, Angela, 25, Salt Lake City, 6500.00  5, Peggy, 27, Salt Lake City, 8500.00  6, King, 22, Salt Lake City, 4500.00  7, Carolyn, 24, Salt Lake City,10000.00  orders.txt  102,2009-10-08 00:00:00,3,3000  100,2009-10-08 00:00:00,3,1500  101,2009-11-20 00:00:00,2,1560  103,2008-05-20 00:00:00,4,2060
  • 16. Page 15Classification: Restricted Topics to be covered in next session PIG • Loads in Pig Continued • Verification • Filters • Macros in Pig