SlideShare a Scribd company logo
1 of 19
Apache Pig
Sachin Vakkund
KLE Technological University
sachinvakkund6@gmail.com
linkedin.com/in/sachinvakkund
WHAT IS PIG ?
• Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop.
• Is a tool/platform which is used to analyze larger sets of data representing them as data flows.
• Pig generates and compiles a Map/Reduce program(s) on the fly.
• To write data analysis programs, Pig provides a high-level language known as Pig Latin.
• Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.
WHY PIG ?
• Using Pig Latin, programmers can perform MapReduce tasks easily without having
to type complex codes in Java.
• Instead of writing 200 LoC in java, we write 10 LoC in Apache_Pig for an operation.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc.
FEATURES OF PIG
• Rich set of operators - join, sort, filer, etc.
• Ease of programming - similar to SQL
• Optimization opportunities – optimize automatically
• Extensibility - users can develop their own functions to read, process, and write data.
• Handles all kinds of data – analyzes structured and unstructured data
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is pretty
simple.
It is quite difficult in MapReduce to perform a Join
operation between datasets.
Any novice programmer with a basic knowledge of
can work conveniently with Apache Pig.
Exposure to Java is must to work with MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more the
number of lines to perform the same task.
There is no need for compilation. On execution, every
Apache Pig operator is converted internally into a
MapReduce job.
MapReduce jobs have a long compilation process.
Apache Pig vs MapReduce
Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can
store data without designing a schema
(values are stored as $01, $02 etc.)
Schema is mandatory in SQL.
The data model in Apache Pig is nested
relational.
The data model used in SQL is flat relational.
Apache Pig provides limited opportunity
for Query optimization.
There is more opportunity for query
optimization in SQL.
Apache Pig vs SQL
Apache Pig Vs Hive
Apache Pig Hive
Apache Pig uses a language called Pig
It was originally created at Yahoo.
Hive uses a language called HiveQL. It was
originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits
in pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured,
unstructured, and semi-structured data.
Hive is mostly for structured data.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks
involving ad-hoc processing and quick prototyping.
Apache Pig is used −
•To process huge data sources such as web logs.
•To perform data processing for search platforms.
•To process time sensitive data loads.
CREATE YOUR FIRST PIG PROGRAM
 Problem Statement:
 Find out Number of Products Sold in Each Country.
 Input: Our input data set is a CSV file, SalesJan2009.csv
PREREQUISITES:
 This is developed on Linux - Ubuntu operating System.
 You should have Hadoop (version 2.2.0 used for this tutorial) already
installed and is running on the system.
 You should have Java (version 1.8.0 used for this tutorial) already
installed on the system.
 You should have set JAVA_HOME accordingly.
 This guide is divided into 2 parts
 Pig Installation
 Pig Demo
PART 1) PIG INSTALLATION
 change user to 'hduser' (user used for Hadoop configuration).
 Step 1) Download stable latest release of Pig (version 0.12.1 used for
this tutorial) from any one of the mirrors sites available at
 http://pig.apache.org/releases.html
 Select tar.gz (and not src.tar.gz) file to download.
 Step 2) Once download is complete, navigate to the directory
containing the downloaded tar file and move the tar to the location
where you want to setup Pig. In this case we will move to /usr/local
 Move to directory containing Pig Files
 cd /usr/local
 Extract contents of tar file as below
 sudo tar -xvf pig-0.12.1.tar.gz
 Step 3). Modify ~/.bashrc to add Pig related environment variables
 Open ~/.bashrc file in any text editor of your choice and do below
modifications-
 export PIG_HOME=<Installation directory of Pig>
 export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
 Step 4) Now, source this environment configuration using below
command
 . ~/.bashrc
 Step 5) We need to recompile PIG to support Hadoop 2.2.0
 Here are the steps to do this-
 Go to PIG home directory
 cd $PIG_HOME
 Install ant
 sudo apt-get install ant
 Step 6) Test the Pig installation using command
 pig -help
PART 2) PIG DEMO
 Step 7) Start Hadoop
 $HADOOP_HOME/sbin/start-dfs.sh
 $HADOOP_HOME/sbin/start-yarn.sh
 Step 8) Pig takes file from HDFS in MapReduce mode and stores the
results back to HDFS.
 Copy file SalesJan2009.csv (stored on local file
system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File
System) Home Directory
 Here the file is in Folder input. If the file is stored in some other
location give that name
 $HADOOP_HOME/bin/hdfs dfs -copyFromLocal
~/input/SalesJan2009.csv /
 Step 9) Pig Configuration
 First navigate to $PIG_HOME/conf
 cd $PIG_HOME/conf
 sudo cp pig.properties pig.properties.original
 Open pig.properties using text editor of your choice, and specify log
file path using pig.logfile
 sudo gedit pig.properties
 Step 10) Run command 'pig' which will start Pig command prompt which is an interactive
shell Pig queries.
 Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.
 Press Enter after this command.
 Group data by field Country
 -- GroupByCountry = GROUP salesTable BY Country;
 For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of
Country : No. of products sold.
 -- CountByCountry = FOREACH GroupByCountry GENERATE
CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
 Press Enter after this command.
salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:charar
ray,Last_Login:chararray,Latitude:chararray,Longitude:chararray);
 Store the results of Data Flow in the directory 'pig_output_sales' on
HDFS
 -- STORE CountByCountry INTO 'pig_output_sales' USING
PigStorage('t');
 Step 12) Result can be seen through command interface as,
 $HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
 Step 12) Result can be seen through command interface as,
 $HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
CONCLUSION
 Pig enables people to focus more on analyzing bulk data sets and to
spend less time in writing Map-Reduce programs.
 THANK YOU

More Related Content

What's hot

Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopGetInData
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseNag Arvind Gudiseva
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 

What's hot (20)

Apache Pig
Apache PigApache Pig
Apache Pig
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Unit 4 lecture-3
Unit 4 lecture-3Unit 4 lecture-3
Unit 4 lecture-3
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
01 hbase
01 hbase01 hbase
01 hbase
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 

Similar to An Introduction to Apache Pig

Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components Rupak Roy
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.Triloki Gupta
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)Durga Gadiraju
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in AmritsarE2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in MohaliE2MATRIX
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online trainingsrikanthhadoop
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 

Similar to An Introduction to Apache Pig (20)

Pig
PigPig
Pig
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Pig
PigPig
Pig
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Unit 5
Unit  5Unit  5
Unit 5
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

An Introduction to Apache Pig

  • 1. Apache Pig Sachin Vakkund KLE Technological University sachinvakkund6@gmail.com linkedin.com/in/sachinvakkund
  • 2. WHAT IS PIG ? • Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. • Is a tool/platform which is used to analyze larger sets of data representing them as data flows. • Pig generates and compiles a Map/Reduce program(s) on the fly. • To write data analysis programs, Pig provides a high-level language known as Pig Latin. • Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 3. WHY PIG ? • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java. • Instead of writing 200 LoC in java, we write 10 LoC in Apache_Pig for an operation. • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL. • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc.
  • 4. FEATURES OF PIG • Rich set of operators - join, sort, filer, etc. • Ease of programming - similar to SQL • Optimization opportunities – optimize automatically • Extensibility - users can develop their own functions to read, process, and write data. • Handles all kinds of data – analyzes structured and unstructured data
  • 5. Apache Pig MapReduce Apache Pig is a data flow language. MapReduce is a data processing paradigm. It is a high level language. MapReduce is low level and rigid. Performing a Join operation in Apache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation between datasets. Any novice programmer with a basic knowledge of can work conveniently with Apache Pig. Exposure to Java is must to work with MapReduce. Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. MapReduce will require almost 20 times more the number of lines to perform the same task. There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a long compilation process. Apache Pig vs MapReduce
  • 6. Pig SQL Pig Latin is a procedural language. SQL is a declarative language. In Apache Pig, schema is optional. We can store data without designing a schema (values are stored as $01, $02 etc.) Schema is mandatory in SQL. The data model in Apache Pig is nested relational. The data model used in SQL is flat relational. Apache Pig provides limited opportunity for Query optimization. There is more opportunity for query optimization in SQL. Apache Pig vs SQL
  • 7. Apache Pig Vs Hive Apache Pig Hive Apache Pig uses a language called Pig It was originally created at Yahoo. Hive uses a language called HiveQL. It was originally created at Facebook. Pig Latin is a data flow language. HiveQL is a query processing language. Pig Latin is a procedural language and it fits in pipeline paradigm. HiveQL is a declarative language. Apache Pig can handle structured, unstructured, and semi-structured data. Hive is mostly for structured data.
  • 8. Applications of Apache Pig Apache Pig is generally used by data scientists for performing tasks involving ad-hoc processing and quick prototyping. Apache Pig is used − •To process huge data sources such as web logs. •To perform data processing for search platforms. •To process time sensitive data loads.
  • 9. CREATE YOUR FIRST PIG PROGRAM  Problem Statement:  Find out Number of Products Sold in Each Country.  Input: Our input data set is a CSV file, SalesJan2009.csv
  • 10. PREREQUISITES:  This is developed on Linux - Ubuntu operating System.  You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.  You should have Java (version 1.8.0 used for this tutorial) already installed on the system.  You should have set JAVA_HOME accordingly.  This guide is divided into 2 parts  Pig Installation  Pig Demo
  • 11. PART 1) PIG INSTALLATION  change user to 'hduser' (user used for Hadoop configuration).  Step 1) Download stable latest release of Pig (version 0.12.1 used for this tutorial) from any one of the mirrors sites available at  http://pig.apache.org/releases.html  Select tar.gz (and not src.tar.gz) file to download.  Step 2) Once download is complete, navigate to the directory containing the downloaded tar file and move the tar to the location where you want to setup Pig. In this case we will move to /usr/local  Move to directory containing Pig Files  cd /usr/local  Extract contents of tar file as below  sudo tar -xvf pig-0.12.1.tar.gz
  • 12.  Step 3). Modify ~/.bashrc to add Pig related environment variables  Open ~/.bashrc file in any text editor of your choice and do below modifications-  export PIG_HOME=<Installation directory of Pig>  export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH  Step 4) Now, source this environment configuration using below command  . ~/.bashrc
  • 13.  Step 5) We need to recompile PIG to support Hadoop 2.2.0  Here are the steps to do this-  Go to PIG home directory  cd $PIG_HOME  Install ant  sudo apt-get install ant  Step 6) Test the Pig installation using command  pig -help
  • 14. PART 2) PIG DEMO  Step 7) Start Hadoop  $HADOOP_HOME/sbin/start-dfs.sh  $HADOOP_HOME/sbin/start-yarn.sh  Step 8) Pig takes file from HDFS in MapReduce mode and stores the results back to HDFS.  Copy file SalesJan2009.csv (stored on local file system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File System) Home Directory  Here the file is in Folder input. If the file is stored in some other location give that name  $HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /
  • 15.  Step 9) Pig Configuration  First navigate to $PIG_HOME/conf  cd $PIG_HOME/conf  sudo cp pig.properties pig.properties.original  Open pig.properties using text editor of your choice, and specify log file path using pig.logfile  sudo gedit pig.properties
  • 16.  Step 10) Run command 'pig' which will start Pig command prompt which is an interactive shell Pig queries.  Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.  Press Enter after this command.  Group data by field Country  -- GroupByCountry = GROUP salesTable BY Country;  For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of Country : No. of products sold.  -- CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));  Press Enter after this command. salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:charar ray,Last_Login:chararray,Latitude:chararray,Longitude:chararray);
  • 17.  Store the results of Data Flow in the directory 'pig_output_sales' on HDFS  -- STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('t');  Step 12) Result can be seen through command interface as,  $HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000  Step 12) Result can be seen through command interface as,  $HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000
  • 18. CONCLUSION  Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.