SlideShare a Scribd company logo
1 of 13
Unit-4
Pig: Hadoop Programming Made
Easier: Admiring the Pig
Architecture
Java MapReduce programs and the Hadoop
Distributed File System (HDFS) provide you with
a powerful distributed computing framework, but
they come with one major drawback — relying
on them limits the use of Hadoop to Java
programmers who can think in Map and Reduce
terms when writing programs.
Pig is a programming tool attempting to have the best
of both worlds: a declarative query language inspired
by SQL and a low-level procedural programming
language that can generate MapReduce code. This
lowers the bar when it comes to the level of technical
knowledge needed to exploit the power of Hadoop.
Pig was initially developed at Yahoo! in 2006 as part of a
research project tasked with coming up with ways for
people using Hadoop to focus more on analyzing large
data sets rather than spending lots of time writing Java
MapReduce programs. The goal here was a familiar one:
Allow users to focus more on what they want to do and
less on how it‘s done. Not long after, in 2007, Pig officially
became an Apache project. As such, it is included in most
Hadoop distributions.
 The Pig programming language is designed to handle any kind of
data tossed its way — structured, semistructured, unstructured data.
Pigs, of course, have a reputation for eating anything they come
across. According to the Apache Pig philosophy, pigs eat anything,
live anywhere, are domesticated and can fly to boot. Pigs ―living
anywhere‖ refers to the fact that Pig is a parallel data processing
programming language and is not committed to any particular
parallel framework — including Hadoop. What makes it a
domesticated animal? Well, if ―domesticated‖ means ―plays well
with humans,‖ then it‘s definitely the case that Pig prides itself on
being easy for humans to code and maintain. Lastly, Pig is smart and
in data processing lingo this means there is an optimizer that figures
out how to do the hard work of figuring out how to get the data
quickly. Pig is not just going to be quick — it‘s going to fly.
Admiring the Pig Architecture
Pig is made up of two components:
a) The language itself: The programming language for Pig is known
as Pig Latin, a high-level language that allows you to write data
processing and analysis programs.
b) b) The Pig Latin compiler: The Pig Latin compiler converts the Pig
Latin code into executable code. The executable code is either in
the form of MapReduce jobs or it can spawn a process where a
virtual Hadoop instance is created to run the Pig code on a single
node.
The sequence of MapReduce programs enables Pig programs to do
data processing and analysis in parallel, leveraging Hadoop
MapReduce and HDFS. Running the Pig job in the virtual Hadoop
instance is a useful strategy for testing your Pig scripts. Figure 1
shows how Pig relates to the Hadoop ecosystem.
Pig programs can run on MapReduce 1 or MapReduce 2
without any code changes, regardless of what mode your
cluster is running. However, Pig scripts can also run using
the Tez API instead. Apache Tez provides a more efficient
execution framework than MapReduce. YARN enables
application frameworks other than MapReduce (like Tez) to
run on Hadoop. Hive can also run against the Tez
framework.
Going with the Pig Latin Application
Flow
 At its core, Pig Latin is a dataflow language, where you define a
data stream and a series of transformations that are applied to the
data as it flows through your application. This is in contrast to a
control flow language (like C or Java), where you write a series of
instructions. In control flow languages, we use constructs like loops
and conditional logic (like an if statement). You won‘t find loops and
if statements in Pig Latin. If you need some convincing that working
with Pig is a significantly easier than having to write Map and
Reduce programs, start by taking a look at some real Pig syntax
The following Listing specifies Sample Pig Code
to illustrate the data processing dataflow.
Some of the text in this example actually looks like English. Looking at each line in
turn, you can see the basic flow of a Pig program. This code can either be part of
a script or issued on the interactive shell called Grunt.
Load: You first load (LOAD) the data you want to manipulate. As in a typical
MapReduce job, that data is stored in HDFS.
 For a Pig program to access the data, you first tell Pig what file or files to
use. For that task, you use the LOAD 'data_file' command. Here, 'data_file'
can specify either an HDFS file or a directory. If a directory is specified, all
files in that directory are loaded into the program. If the data is stored in a file
format that isn‘t natively accessible to Pig, you can optionally add the USING
function to the LOAD statement to specify a user-defined function that can
read in (and interpret) the data.
Transform: You run the data through a set of transformations which are translated
into a set of Map and Reduce tasks. The transformation logic is where all the data
manipulation happens. You can FILTER out rows that aren‘t of interest, JOIN two
sets of data files, GROUP data to build aggregations, ORDER results, and do
much, much more.
Dump: Finally, you dump (DUMP) the results to the screen or Store (STORE) the
results in a file somewhere.
Working through the ABCs of Pig Latin
 Pig Latin is the language for Pig programs. Pig translates the Pig Latin script
into MapReduce jobs that can be executed within Hadoop cluster. When
coming up with Pig Latin, the development team followed three key design
principles:
Keep it simple. Pig Latin provides a streamlined method for interacting with
Java MapReduce. It‘s an abstraction, in other words, that simplifies the
creation of parallel programs on the Hadoop cluster for data flows and
analysis. Complex tasks may require a series of interrelated data
transformations — such series are encoded as data flow sequences. Writing
data transformation and flows as Pig Latin scripts instead of Java MapReduce
programs makes these programs easier to write, understand, and maintain
because a) you don‘t have to write the job in Java, b) you don‘t have to think in
terms of MapReduce, and c) you don‘t need to come up with custom code to
support rich data types. Pig Latin provides a simpler language to exploit your
Hadoop cluster, thus making it easier for more people to leverage the power of
 Make it smart.
You may recall that the Pig Latin Compiler does the work of
transforming a Pig Latin program into a series of Java MapReduce
jobs. The trick is to make sure that the compiler can optimize the
execution of these Java MapReduce jobs automatically, allowing the
user to focus on semantics rather than on how to optimize and
access the dataSQL is set up as a declarative query that you use to
access structured data stored in an RDBMS. The RDBMS engine
first translates the query to a data access method and then looks at
the statistics and generates a series of data access approaches. The
cost-based optimizer chooses the most efficient approach for
execution.
Don’t limit development.
Make Pig extensible so that developers can add functions to address their
particular business problems.
Uncovering Pig Latin structures
Most Pig scripts start with the LOAD statement to read data from HDFS. In this
case, we‘re loading data from a .csv file. Pig has a data model it uses, so next we
need to map the file‘s data model to the Pig data mode. This is accomplished with
the help of the USING statement. We then specify that it is a comma-delimited file
with the PigStorage(',') statement followed by the AS statement defining the name of
each of the columns. Aggregations are commonly used in Pig to summarize
data sets. The GROUP statement is used to aggregate the records into a single
record mileage_recs. The ALL statement is used to aggregate all tuples into a single
group. Note that some statements — including the following SUM statement —
requires a preceding GROUP ALL statement for global sums. FOREACH . . .
GENERATE statements are used here to transformcolumns data In this case,
we want to count the miles traveled in the records_Distance column. The SUM
statement computes the sum of the record_Distance column into a single-column
collection total_miles.
The DUMP operator is used to execute the Pig Latin statement and display the
results on the screen. DUMP is used in interactive mode, which means that the
statements are executable immediately and the results are not saved. Typically, you
will either use the DUMP or STORE operators at the end of your Pig script.

More Related Content

What's hot

Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageMarko Rodriguez
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Datawina wulansari
 
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...Edureka!
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata StrategiesDATAVERSITY
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Informatica Transformations with Examples | Informatica Tutorial | Informatic...
Informatica Transformations with Examples | Informatica Tutorial | Informatic...Informatica Transformations with Examples | Informatica Tutorial | Informatic...
Informatica Transformations with Examples | Informatica Tutorial | Informatic...Edureka!
 
Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databasesyazad dumasia
 
Data Visualization Design Best Practices Workshop
Data Visualization Design Best Practices WorkshopData Visualization Design Best Practices Workshop
Data Visualization Design Best Practices WorkshopJSI
 

What's hot (20)

Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming Language
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
 
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Talend Onl...
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Informatica Transformations with Examples | Informatica Tutorial | Informatic...
Informatica Transformations with Examples | Informatica Tutorial | Informatic...Informatica Transformations with Examples | Informatica Tutorial | Informatic...
Informatica Transformations with Examples | Informatica Tutorial | Informatic...
 
Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databases
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Data Visualization Design Best Practices Workshop
Data Visualization Design Best Practices WorkshopData Visualization Design Best Practices Workshop
Data Visualization Design Best Practices Workshop
 
Data analytics
Data analyticsData analytics
Data analytics
 

Similar to Unit 4 lecture2

Similar to Unit 4 lecture2 (20)

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Pig
PigPig
Pig
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 

More from vishal choudhary (20)

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
 
XML.pptx
XML.pptxXML.pptx
XML.pptx
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
 
SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Unit 4 lecture2

  • 1. Unit-4 Pig: Hadoop Programming Made Easier: Admiring the Pig Architecture
  • 2. Java MapReduce programs and the Hadoop Distributed File System (HDFS) provide you with a powerful distributed computing framework, but they come with one major drawback — relying on them limits the use of Hadoop to Java programmers who can think in Map and Reduce terms when writing programs.
  • 3. Pig is a programming tool attempting to have the best of both worlds: a declarative query language inspired by SQL and a low-level procedural programming language that can generate MapReduce code. This lowers the bar when it comes to the level of technical knowledge needed to exploit the power of Hadoop.
  • 4. Pig was initially developed at Yahoo! in 2006 as part of a research project tasked with coming up with ways for people using Hadoop to focus more on analyzing large data sets rather than spending lots of time writing Java MapReduce programs. The goal here was a familiar one: Allow users to focus more on what they want to do and less on how it‘s done. Not long after, in 2007, Pig officially became an Apache project. As such, it is included in most Hadoop distributions.
  • 5.  The Pig programming language is designed to handle any kind of data tossed its way — structured, semistructured, unstructured data. Pigs, of course, have a reputation for eating anything they come across. According to the Apache Pig philosophy, pigs eat anything, live anywhere, are domesticated and can fly to boot. Pigs ―living anywhere‖ refers to the fact that Pig is a parallel data processing programming language and is not committed to any particular parallel framework — including Hadoop. What makes it a domesticated animal? Well, if ―domesticated‖ means ―plays well with humans,‖ then it‘s definitely the case that Pig prides itself on being easy for humans to code and maintain. Lastly, Pig is smart and in data processing lingo this means there is an optimizer that figures out how to do the hard work of figuring out how to get the data quickly. Pig is not just going to be quick — it‘s going to fly.
  • 6. Admiring the Pig Architecture Pig is made up of two components: a) The language itself: The programming language for Pig is known as Pig Latin, a high-level language that allows you to write data processing and analysis programs. b) b) The Pig Latin compiler: The Pig Latin compiler converts the Pig Latin code into executable code. The executable code is either in the form of MapReduce jobs or it can spawn a process where a virtual Hadoop instance is created to run the Pig code on a single node. The sequence of MapReduce programs enables Pig programs to do data processing and analysis in parallel, leveraging Hadoop MapReduce and HDFS. Running the Pig job in the virtual Hadoop instance is a useful strategy for testing your Pig scripts. Figure 1 shows how Pig relates to the Hadoop ecosystem.
  • 7. Pig programs can run on MapReduce 1 or MapReduce 2 without any code changes, regardless of what mode your cluster is running. However, Pig scripts can also run using the Tez API instead. Apache Tez provides a more efficient execution framework than MapReduce. YARN enables application frameworks other than MapReduce (like Tez) to run on Hadoop. Hive can also run against the Tez framework.
  • 8. Going with the Pig Latin Application Flow  At its core, Pig Latin is a dataflow language, where you define a data stream and a series of transformations that are applied to the data as it flows through your application. This is in contrast to a control flow language (like C or Java), where you write a series of instructions. In control flow languages, we use constructs like loops and conditional logic (like an if statement). You won‘t find loops and if statements in Pig Latin. If you need some convincing that working with Pig is a significantly easier than having to write Map and Reduce programs, start by taking a look at some real Pig syntax
  • 9. The following Listing specifies Sample Pig Code to illustrate the data processing dataflow. Some of the text in this example actually looks like English. Looking at each line in turn, you can see the basic flow of a Pig program. This code can either be part of a script or issued on the interactive shell called Grunt. Load: You first load (LOAD) the data you want to manipulate. As in a typical MapReduce job, that data is stored in HDFS.
  • 10.  For a Pig program to access the data, you first tell Pig what file or files to use. For that task, you use the LOAD 'data_file' command. Here, 'data_file' can specify either an HDFS file or a directory. If a directory is specified, all files in that directory are loaded into the program. If the data is stored in a file format that isn‘t natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in (and interpret) the data. Transform: You run the data through a set of transformations which are translated into a set of Map and Reduce tasks. The transformation logic is where all the data manipulation happens. You can FILTER out rows that aren‘t of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and do much, much more. Dump: Finally, you dump (DUMP) the results to the screen or Store (STORE) the results in a file somewhere.
  • 11. Working through the ABCs of Pig Latin  Pig Latin is the language for Pig programs. Pig translates the Pig Latin script into MapReduce jobs that can be executed within Hadoop cluster. When coming up with Pig Latin, the development team followed three key design principles: Keep it simple. Pig Latin provides a streamlined method for interacting with Java MapReduce. It‘s an abstraction, in other words, that simplifies the creation of parallel programs on the Hadoop cluster for data flows and analysis. Complex tasks may require a series of interrelated data transformations — such series are encoded as data flow sequences. Writing data transformation and flows as Pig Latin scripts instead of Java MapReduce programs makes these programs easier to write, understand, and maintain because a) you don‘t have to write the job in Java, b) you don‘t have to think in terms of MapReduce, and c) you don‘t need to come up with custom code to support rich data types. Pig Latin provides a simpler language to exploit your Hadoop cluster, thus making it easier for more people to leverage the power of
  • 12.  Make it smart. You may recall that the Pig Latin Compiler does the work of transforming a Pig Latin program into a series of Java MapReduce jobs. The trick is to make sure that the compiler can optimize the execution of these Java MapReduce jobs automatically, allowing the user to focus on semantics rather than on how to optimize and access the dataSQL is set up as a declarative query that you use to access structured data stored in an RDBMS. The RDBMS engine first translates the query to a data access method and then looks at the statistics and generates a series of data access approaches. The cost-based optimizer chooses the most efficient approach for execution. Don’t limit development. Make Pig extensible so that developers can add functions to address their particular business problems.
  • 13. Uncovering Pig Latin structures Most Pig scripts start with the LOAD statement to read data from HDFS. In this case, we‘re loading data from a .csv file. Pig has a data model it uses, so next we need to map the file‘s data model to the Pig data mode. This is accomplished with the help of the USING statement. We then specify that it is a comma-delimited file with the PigStorage(',') statement followed by the AS statement defining the name of each of the columns. Aggregations are commonly used in Pig to summarize data sets. The GROUP statement is used to aggregate the records into a single record mileage_recs. The ALL statement is used to aggregate all tuples into a single group. Note that some statements — including the following SUM statement — requires a preceding GROUP ALL statement for global sums. FOREACH . . . GENERATE statements are used here to transformcolumns data In this case, we want to count the miles traveled in the records_Distance column. The SUM statement computes the sum of the record_Distance column into a single-column collection total_miles. The DUMP operator is used to execute the Pig Latin statement and display the results on the screen. DUMP is used in interactive mode, which means that the statements are executable immediately and the results are not saved. Typically, you will either use the DUMP or STORE operators at the end of your Pig script.