Apache Pig is a high-level data flow platform for executing MapReduce programs on Hadoop. The language used for Pig is called Pig Latin. Pig scripts get converted into MapReduce jobs that are executed on data stored in HDFS. Pig can handle structured, semi-structured, or unstructured data and store results back in HDFS. Common Pig operations include joining, sorting, filtering, grouping, and using built-in and user-defined functions.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
NoSQL, as many of you may already know, is basically a database used to manage huge sets of unstructured data, where in the data is not stored in tabular relations like relational databases. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like:
• Continuously changing nature of data - structured, semi-structured, unstructured and polymorphic data.
• Applications now serve millions of users in different geo-locations, in different timezones and have to be up and running all the time, with data integrity maintained
• Applications are becoming more distributed with many moving towards cloud computing.
NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive set of data that is being made available on multiple virtual servers (remote based) in the cloud infrastructure and mainly when the data set is not structured. Hence, the NoSQL database is designed to overcome the Performance, Scalability, Data Modelling and Distribution limitations that are seen in the Relational Databases.
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
NoSQL, as many of you may already know, is basically a database used to manage huge sets of unstructured data, where in the data is not stored in tabular relations like relational databases. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like:
• Continuously changing nature of data - structured, semi-structured, unstructured and polymorphic data.
• Applications now serve millions of users in different geo-locations, in different timezones and have to be up and running all the time, with data integrity maintained
• Applications are becoming more distributed with many moving towards cloud computing.
NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive set of data that is being made available on multiple virtual servers (remote based) in the cloud infrastructure and mainly when the data set is not structured. Hence, the NoSQL database is designed to overcome the Performance, Scalability, Data Modelling and Distribution limitations that are seen in the Relational Databases.
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
This presentation walks through essential points for developing and working with REST APIs or web services to communicate through various platforms. This also explains HTTP methods.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
This Edureka "Hadoop tutorial For Beginners" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to understand the problem with traditional system while processing Big Data and how Hadoop solves it. This tutorial will provide you a comprehensive idea about HDFS and YARN along with their architecture that has been explained in a very simple manner using examples and practical demonstration. At the end, you will get to know how to analyze Olympic data set using Hadoop and gain useful insights.
Below are the topics covered in this tutorial:
1. Big Data Growth Drivers
2. What is Big Data?
3. Hadoop Introduction
4. Hadoop Master/Slave Architecture
5. Hadoop Core Components
6. HDFS Data Blocks
7. HDFS Read/Write Mechanism
8. What is MapReduce
9. MapReduce Program
10. MapReduce Job Workflow
11. Hadoop Ecosystem
12. Hadoop Use Case: Analyzing Olympic Dataset
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
This presentation walks through essential points for developing and working with REST APIs or web services to communicate through various platforms. This also explains HTTP methods.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
This Edureka "Hadoop tutorial For Beginners" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to understand the problem with traditional system while processing Big Data and how Hadoop solves it. This tutorial will provide you a comprehensive idea about HDFS and YARN along with their architecture that has been explained in a very simple manner using examples and practical demonstration. At the end, you will get to know how to analyze Olympic data set using Hadoop and gain useful insights.
Below are the topics covered in this tutorial:
1. Big Data Growth Drivers
2. What is Big Data?
3. Hadoop Introduction
4. Hadoop Master/Slave Architecture
5. Hadoop Core Components
6. HDFS Data Blocks
7. HDFS Read/Write Mechanism
8. What is MapReduce
9. MapReduce Program
10. MapReduce Job Workflow
11. Hadoop Ecosystem
12. Hadoop Use Case: Analyzing Olympic Dataset
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
2. CONTENTS:
PIG background
PIG Architecture
PIG Latin Basics
PIG Execution Modes
PIG Processing: loading and transforming data
PIG Built-in functions
Filtering, grouping, sorting
Data Installation of PIG and PIG Latin commands
3. Apache Pig was
originally developed
at Yahoo Research around
2006 for researchers to have
an ad hoc way of creating
and executing MapReduce
jobs on very large data sets.
In 2007, it was moved into
the Apache Software
Foundation.
4. The story goes that the researchers
working on the project initially
referred to it simply as 'the
language'. Eventually they needed
to call it something.
Off the top of his head, one
researcher suggested Pig, and the
name stuck.
It is quirky yet memorable and easy
to spell.
While some have hinted that the
name sounds coy or silly, it has
provided us with an entertaining
nomenclature, such as Pig Latin for
the language, Grunt for the shell,
and PiggyBank for the CPAN-like
shared repository.
5. Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial
is designed for beginners and professionals.
Pig is a high-level data flow platform for executing Map Reduce programs
of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.
Pig provides an engine for executing data flows in parallel on Hadoop. It
includes a language, Pig Latin, for expressing these data flows.
Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
Pig is an Apache open source project. This means users are free to
download it as source or binary, use it for themselves, contribute to it,
and—under the terms of the Apache License—use it in their products and
change it as they see fit.
6. Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig
is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS. Apart from that,
Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-
structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used
in MapReduce.
7. FEATURES OF APACHE PIG
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
8. ADVANTAGES OF APACHE PIG
•Less code - The Pig consumes less line of
code to perform any operation.
•Reusability - The Pig code is flexible
enough to reuse again.
•Nested data types - The Pig provides a
useful concept of nested data types like
tuple, bag, and map.
9.
10.
11. HIVE VS PIG VS SQL – WHEN TO USE WHAT?
When to Use Hive
Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually
promote Hive language due to its extensive feature list and similarities with SQL. Here are
some of the scenarios when Apache Hive is ideal to use:
• To query large datasets: Apache Hive is specially used for analytics purposes on
huge datasets. It is an easy way to approach and quickly carry out complex querying on
datasets and inspect the datasets stored in the Hadoop ecosystem.
• For extensibility: Apache Hive contains a range of user APIs that help in building
the custom behaviour for the query engine.
• For someone familiar with SQL concepts: If you are familiar with SQL, Hive
will be very easy to use as you will see many similarities between the two. Hive uses the
clauses like select, where, order by, group by, etc. similar to SQL.
• To work on Structured Data: In case of structured data, Hive is widely adopted
everywhere.
• To analyse historical data: Apache Hive is a great tool for analysis and querying
of the data which is historical and collected over a period.
12. When to Use Pig
Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios:
• To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big
data. It is a data flow system that uses Pig Latin, a simple language for data queries and
manipulation.
• As a programmer with the scripting knowledge: The programmers with the
scripting knowledge can learn how to use Apache Pig very easily and efficiently.
• For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
• When you don’t want to work with Schema: In case of Apache Pig, there is no need
for creating a schema for the data loading related work.
• For SQL like functions: It has many functions related to SQL along with the cogroup
function.
13. When to Use SQL
SQL is a general purpose database management language used around the globe. It has
been updating itself as per the user expectations for decades. It is declarative and hence
focuses explicitly on ‘what’ is needed.
It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few
scenarios –
• For better performance: SQL is famous for its ability to pull data quickly and
frequently. It supports OLAP (Online Analytical Processing) applications and performs
better for these applications. Hive is slow in case of online transactional needs.
• When the datasets are small: SQL works well with small datasets and
performs much better for smaller amounts of data. It also has many ways for the
optimisation of data.
• For frequent data manipulation: If your requirement needs frequent
modification in records or you need to update a large number of records frequently, SQL
can perform these activities well. SQL also provides an entirely interactive experience to
the user.
14.
15. APACHE PIG RUN MODES
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
• It executes in a single JVM and is used for development experimenting and
prototyping.
• Here, files are installed and run using localhost.
• The local mode works on a local file system. The input and output data stored
in the local file system.
The command for local mode grunt shell:
1.$ pig-x local
MapReduce Mode
• The MapReduce mode is also known as Hadoop Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and executes them on the
cluster.
• It can be executed against semi-distributed or fully distributed Hadoop
installation.
• Here, the input and output data are present on HDFS.
The command for Map reduce mode:
16. WAYS TO EXECUTE PIG PROGRAM
These are the following ways of executing a Pig program on
local and MapReduce mode: -
• Interactive Mode - In this mode, the Pig is executed in the
Grunt shell. To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
• Batch Mode - In this mode, we can run a script file having a
.pig extension. These files contain Pig Latin commands.
• Embedded Mode - In this mode, we can define our own
functions. These functions can be called as UDF (User
Defined Functions). Here, we use programming languages
like Java and Python.
18. The language used to analyse data in Hadoop using Pig is known
as Pig Latin.
It is a high-level data processing language which provides a rich
set of data types and operators to perform various operations on
the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the
desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
19. APACHE PIG COMPONENTS
As shown in the figure, there are various components in the Apache
Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax
of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
20. Compiler
The compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to
Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
21. PIG LATIN DATA MODEL
The data model of Pig Latin is
fully nested and it allows
complex non-atomic datatypes
such as map and tuple.
Given below is the
diagrammatical representation
of Pig Latin’s data model.
22. An atomic value is one that is indivisible within the
context of a database field definition (e.g. integer,
real, code of some sort etc.)
Field values that are not atomic are of two
undesirable types (Elmasri & Navathe 1989
p.139,41):
Undesirable - non atomic field types: Composite.
Multivalued.
23. Atom
Any single value in Pig Latin, irrespective of their data,
type is known as an Atom. It is stored as string and
can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is
known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is
known as a tuple, the fields can be of any type. A tuple
is similar to a row in a table of RDBMS.
Example − (Raja, 30)
24. Bag
A bag is an unordered set of tuples.
In other words, a collection of tuples (non-unique) is known as a
bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’.
It is similar to a table in RDBMS, but unlike a table in RDBMS, it
is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the
same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
25. Map
A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented
by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin
are unordered (there is no guarantee that tuples are
processed in any particular order).
26. Grunt shell is a shell command.
The Grunts shell of Apace pig is mainly used to write
pig Latin scripts.
Pig script can be executed with grunt shell which is
native shell provided by Apache pig to execute pig
queries.
We can invoke shell commands using sh and fs.
27. JOB EXECUTION FLOW
The developer creates the scripts, and then it goes to
the local file system as functions.
Moreover, when the developers submit Pig Script, it
contacts with Pig Latin Compiler.
The compiler then splits the task and run a series of MR
jobs.
Meanwhile, Pig Compiler retrieves data from the HDFS.
The output file again goes to the HDFS after running MR
jobs.
28. a. Pig Execution Modes
We can run Pig in two execution modes.
These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing.
We can thus store data on a single machine or in a
distributed environment like Clusters.
The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to
create a file, load the code and execute the script.
Then comes the Grunt shell or interactive shell for
running Apache Pig commands.
Hence, the last one named as embedded mode, which
we can use JDBC to run SQL programs from Java.
29. b. Pig Local mode
However, in this mode, pig implements on single
JVM and access the file system.
This mode is better for dealing with the small data
sets.
Meanwhile, the parallel mapper execution is
impossible.
The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig
local mode of execution.
Therefore, Pig always looks for the local file system
path while loading data.
30. c. Pig Map Reduce Mode
In this mode, a user could have proper Hadoop
cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig
also translates the queries into Map reduce
jobs and runs on top of Hadoop cluster. Hence,
this mode as a Map reduce runs on a
distributed cluster.
The statements like LOAD, STORE read the
data from the HDFS file system and to show
output. These Statements are also used to
31. d. Storing Results
The intermediate data generates during the
processing of MR jobs.
Pig stores this data in a non-permanent location
on HDFS storage.
The temporary location then created inside
HDFS for storing this intermediate data.
We can use DUMP for getting the final results
to the output screen.
The output results stored using STORE
operator.
32. Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
33. Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
34. BinStorage
Loads and stores data in machine-readable (binary) format
PigStorage
Loads and stores data as structured, field delimited text files
TextLoader
Loads unstructured data in UTF-8 format
PigDump
Stores data in UTF-8 format
YourOwnFormat!
via UDFs
35. Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name, salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
Each LOAD statement defines a new bag
Each bag can have multiple elements (atoms)
Each element can be referenced by name or position ($n)
A bag is immutable
A bag can be aliased and referenced later
36. STORE
Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
Fails if directory exists
Writes output files, part-[m|r]-xxxxx, to the directory
PigStorage can be used to specify a field delimiter
DUMP
Write output to screen
grunt> DUMP processed;
37. FOREACH
Applies expressions to every record in a bag
FILTER
Filters by expression
GROUP
Collect records with the same key
ORDER BY
Sorting
DISTINCT
Removes duplicates
38. Use the FILTER operator to restrict tuples or rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
39. Use the GROUP…ALL operator to group data
Use GROUP when only one relation is involved
Use COGROUP with multiple relations are involved
Basic syntax:
alias2 = GROUP alias1 ALL;
Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
40. Use the ORDER…BY operator to sort a relation based on one
or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
41. Use the DISTINCT operator to remove duplicate tuples in
a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
42. FLATTEN
Used to un-nest tuples as well as bags
INNER JOIN
Used to perform an inner join of two or more relations based on
common field values
OUTER JOIN
Used to perform left, right or full outer joins
SPLIT
Used to partition the contents of a relation into two or more
relations
SAMPLE
Used to select a random data sample with the stated sample
size
43. Use the JOIN operator to perform an inner, equi-
join join of two or more relations based on common
field values
The JOIN operator always performs an inner join
Inner joins ignore null keys
Filter null keys before the join
JOIN and COGROUP operators perform similar
functions
JOIN creates a flat set of output records
COGROUP creates a nested set of output records
45. Use the OUTER JOIN operator to perform left, right, or full
outer joins
Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
Outer joins will only work for two-way joins
To perform a multi-way outer join perform multiple two-way outer
join statements
46. Natively written in Java, packaged as a jar file
Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER statement
Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
47. DEFINE can be used to work with UDFs and also
streaming commands
Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING
PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
/* Define UDFs to a more readable format */
DEFINE MAXNUM
org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;