SlideShare a Scribd company logo
1 of 19
PIG
Apache Pig is a platform for analyzing large data sets that consists
of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable
to substantial parallelization, which in turns enables them to handle very
large data sets.
At the present time, Pig's infrastructure layer consists of a compiler
that produces sequences of Map-Reduce programs, for which
large-scale parallel implementations already exist (e.g., the Hadoop
subproject).
Pig's language layer currently consists of a textual language called Pig
Latin.
key properties:
• Ease of programming. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data analysis
tasks. Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data flow
sequences, making them easy to write, understand, and
maintain.
• Optimization opportunities. The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics
rather than efficiency.
• Extensibility. Users can create their own functions to do
special-purpose proc
Apache pig framework has below major components as part of its
Architecture:
• Parser
• Optimizer
• Compiler
• Execution Engine
ARCHITECTURE
Parser: Any pig scripts or commands in the grunt shell are
handled by the parser.
Parse will perform checks on the scripts like the syntax of the
scripts, do type checking and perform various other checks.
These checks will give output in a Directed Acyclic Graph
(DAG) form, which has a pig Latin statements and logical
operators.
The DAG will have nodes that are connected to different edges,
here our logical operator of the scripts are nodes and data flows
are edges.
• 2. Optimizer: As soon as parsing is completed and DAG is
generated, It is then passed to the logical optimizer to perform
logical optimization like projection and pushdown.
• Projection and pushdown are done to improve query
performance by omitting unnecessary columns or data and
prune the loader to only load the necessary column.
• 3. Compiler: The optimized logical plan generated above is
compiled by the compiler and generates a series of Map-
Reduce jobs.
• Basically compiler will convert pig job automatically into
MapReduce jobs and exploit optimizations opportunities in
scripts, due this programmer doesn’t have to tune the program
manually.
• As pig is a data-flow language its compiler can reorder the
execution sequence to optimize performance if the
execution plan remains the same as the original program.
• 4. Execution Engine: Finally, all the MapReduce jobs
generated via compiler are submitted to Hadoop in sorted
order. In the end, MapReduce’s job is executed on Hadoop
to produce the desired output.
• 5. Execution Mode: Pig works in two types of execution modes
depend on where the script is running and data availability :
• Local Mode: Local mode is best suited for small data sets.
• Pig is implemented here on single JVM as all files are installed
and run on localhost due to this parallel mapper execution is
not possible.
• Also while loading data pig will always look into the local file
system.
• MapReduce Mode (MR Mode): In MapReduce, the mode
programmer needs access and setup of the Hadoop cluster
and HDFS installation.
• In this mode data on which processing is done is exists in the
HDFS system.
• After execution of pig script in MR mode, pig Latin statement
is converted into Map Reduce jobs in the back-end to
perform the operations on the data. By default pig uses Map
Reduce mode, hence we don’t need to specify it using the -x
flag.
GRUNT
• Grunt is a Pig interactive shell.
• After invoking the Grunt shell, you can run your Pig
scripts in the shell.
• Commands: HDFS commands in PigGrunt
• 1. fs-ls /
• 2.fs –cat /
3. fs –mkdir /
4. fs –copyFromLocal
Shell commands in PigGrunt
• Any shell command can be invoked by sh and fs
• Sh ls command
• Sh cat
• Clear
• Help
• History
• set- assigns value to keys example
• > set job.name ‘myjob’
• Exec command
• Kill
• Run command
• quit
PigLatin
• Pig is a high-level platform or tool which is used to process the large
datasets.
• It provides a high-level of abstraction for processing over the
MapReduce.
• It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
• First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language.
• Internally Pig Engine(a component of Apache Pig) converted all
these scripts into a specific map and reduce task.
• But these are not visible to the programmers in order to provide a
high-level of abstraction.
• Pig Latin and Pig Engine are the two main components of the Apache
Pig tool. The result of Pig always stored in the HDFS.
Pig Latin statements
Pig Latin statements are generally organized as follows:
• A LOAD statement to read data from the file system.
• A series of "transformation" statements to process the
data.
• A DUMP statement to view results or a STORE
statement to save the results.
Note that a DUMP or STORE statement is required to
generate output.
Piglatin operators
• Arithmetic operator
• Relational operations
-load,store,filter,distinct,join,group,order,limit etc
• Comparision operator
• Type construction operator
()- tuple constructor,[]-map constructor,{}-bag constructor
• Diagnostic operator
1. dump,
2.describe-to verify the schema of a relation,
3.Explain-to verify the logical plan,physical plan and mapreduce plan of a relation
4.Illustration-to review how the data are transformed
HIVE
• It is a data warehouse software for providing data query and
analysis.
• Developed by Facebook and built on top of Apache Hadoop.
• Provides support for reading, writing, and managing large
dataset that is stored on Hadoop HDFS
• Hiveql
There are three core parts of Hive Architecture:-
• Hive Client
• Hive Services
• Hive Storage and Computer
Hive architecture:
• Hive Client
• Hive provides multiple drivers with multiple types of applications
for communication. Hive supports all apps written in
programming languages like Python, C++, Java, etc.
• There are three categorized this client-
• Hive Thrift Clients
• Hive JDBC Driver
• Hive ODBC Driver

More Related Content

Similar to A slide share pig in CCS334 for big data analytics

Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components Rupak Roy
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 

Similar to A slide share pig in CCS334 for big data analytics (20)

Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Apache pig
Apache pigApache pig
Apache pig
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Pig
PigPig
Pig
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 

More from KrishnaVeni451953

black-box testing is a type of software testing in which the tester is not co...
black-box testing is a type of software testing in which the tester is not co...black-box testing is a type of software testing in which the tester is not co...
black-box testing is a type of software testing in which the tester is not co...KrishnaVeni451953
 
Grey box testing in software security involves assessing the security of a sy...
Grey box testing in software security involves assessing the security of a sy...Grey box testing in software security involves assessing the security of a sy...
Grey box testing in software security involves assessing the security of a sy...KrishnaVeni451953
 
Design and Evaluation techniques unit 5
Design and Evaluation techniques unit  5Design and Evaluation techniques unit  5
Design and Evaluation techniques unit 5KrishnaVeni451953
 
Guidelines principle and theories in UID
Guidelines principle and theories in UIDGuidelines principle and theories in UID
Guidelines principle and theories in UIDKrishnaVeni451953
 
Alpha-beta pruning can be applied at any depth of a tree
Alpha-beta pruning can be applied at any depth of a treeAlpha-beta pruning can be applied at any depth of a tree
Alpha-beta pruning can be applied at any depth of a treeKrishnaVeni451953
 
Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...KrishnaVeni451953
 
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPERKrishnaVeni451953
 

More from KrishnaVeni451953 (7)

black-box testing is a type of software testing in which the tester is not co...
black-box testing is a type of software testing in which the tester is not co...black-box testing is a type of software testing in which the tester is not co...
black-box testing is a type of software testing in which the tester is not co...
 
Grey box testing in software security involves assessing the security of a sy...
Grey box testing in software security involves assessing the security of a sy...Grey box testing in software security involves assessing the security of a sy...
Grey box testing in software security involves assessing the security of a sy...
 
Design and Evaluation techniques unit 5
Design and Evaluation techniques unit  5Design and Evaluation techniques unit  5
Design and Evaluation techniques unit 5
 
Guidelines principle and theories in UID
Guidelines principle and theories in UIDGuidelines principle and theories in UID
Guidelines principle and theories in UID
 
Alpha-beta pruning can be applied at any depth of a tree
Alpha-beta pruning can be applied at any depth of a treeAlpha-beta pruning can be applied at any depth of a tree
Alpha-beta pruning can be applied at any depth of a tree
 
Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...
 
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
 

Recently uploaded

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

A slide share pig in CCS334 for big data analytics

  • 1. PIG Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin.
  • 2. key properties: • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. • Extensibility. Users can create their own functions to do special-purpose proc
  • 3. Apache pig framework has below major components as part of its Architecture: • Parser • Optimizer • Compiler • Execution Engine
  • 5. Parser: Any pig scripts or commands in the grunt shell are handled by the parser. Parse will perform checks on the scripts like the syntax of the scripts, do type checking and perform various other checks. These checks will give output in a Directed Acyclic Graph (DAG) form, which has a pig Latin statements and logical operators. The DAG will have nodes that are connected to different edges, here our logical operator of the scripts are nodes and data flows are edges.
  • 6. • 2. Optimizer: As soon as parsing is completed and DAG is generated, It is then passed to the logical optimizer to perform logical optimization like projection and pushdown. • Projection and pushdown are done to improve query performance by omitting unnecessary columns or data and prune the loader to only load the necessary column.
  • 7. • 3. Compiler: The optimized logical plan generated above is compiled by the compiler and generates a series of Map- Reduce jobs. • Basically compiler will convert pig job automatically into MapReduce jobs and exploit optimizations opportunities in scripts, due this programmer doesn’t have to tune the program manually. • As pig is a data-flow language its compiler can reorder the execution sequence to optimize performance if the execution plan remains the same as the original program.
  • 8. • 4. Execution Engine: Finally, all the MapReduce jobs generated via compiler are submitted to Hadoop in sorted order. In the end, MapReduce’s job is executed on Hadoop to produce the desired output. • 5. Execution Mode: Pig works in two types of execution modes depend on where the script is running and data availability :
  • 9. • Local Mode: Local mode is best suited for small data sets. • Pig is implemented here on single JVM as all files are installed and run on localhost due to this parallel mapper execution is not possible. • Also while loading data pig will always look into the local file system.
  • 10. • MapReduce Mode (MR Mode): In MapReduce, the mode programmer needs access and setup of the Hadoop cluster and HDFS installation. • In this mode data on which processing is done is exists in the HDFS system. • After execution of pig script in MR mode, pig Latin statement is converted into Map Reduce jobs in the back-end to perform the operations on the data. By default pig uses Map Reduce mode, hence we don’t need to specify it using the -x flag.
  • 11. GRUNT • Grunt is a Pig interactive shell. • After invoking the Grunt shell, you can run your Pig scripts in the shell. • Commands: HDFS commands in PigGrunt • 1. fs-ls / • 2.fs –cat / 3. fs –mkdir / 4. fs –copyFromLocal
  • 12. Shell commands in PigGrunt • Any shell command can be invoked by sh and fs • Sh ls command • Sh cat • Clear • Help • History • set- assigns value to keys example • > set job.name ‘myjob’ • Exec command • Kill • Run command • quit
  • 13. PigLatin • Pig is a high-level platform or tool which is used to process the large datasets. • It provides a high-level of abstraction for processing over the MapReduce. • It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. • First, to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language.
  • 14. • Internally Pig Engine(a component of Apache Pig) converted all these scripts into a specific map and reduce task. • But these are not visible to the programmers in order to provide a high-level of abstraction. • Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig always stored in the HDFS.
  • 15. Pig Latin statements Pig Latin statements are generally organized as follows: • A LOAD statement to read data from the file system. • A series of "transformation" statements to process the data. • A DUMP statement to view results or a STORE statement to save the results. Note that a DUMP or STORE statement is required to generate output.
  • 16. Piglatin operators • Arithmetic operator • Relational operations -load,store,filter,distinct,join,group,order,limit etc • Comparision operator • Type construction operator ()- tuple constructor,[]-map constructor,{}-bag constructor • Diagnostic operator 1. dump, 2.describe-to verify the schema of a relation, 3.Explain-to verify the logical plan,physical plan and mapreduce plan of a relation 4.Illustration-to review how the data are transformed
  • 17. HIVE • It is a data warehouse software for providing data query and analysis. • Developed by Facebook and built on top of Apache Hadoop. • Provides support for reading, writing, and managing large dataset that is stored on Hadoop HDFS • Hiveql There are three core parts of Hive Architecture:- • Hive Client • Hive Services • Hive Storage and Computer
  • 19. • Hive Client • Hive provides multiple drivers with multiple types of applications for communication. Hive supports all apps written in programming languages like Python, C++, Java, etc. • There are three categorized this client- • Hive Thrift Clients • Hive JDBC Driver • Hive ODBC Driver