A slide share pig in CCS334 for big data analytics

PIG
Apache Pig is a platform for analyzing large data sets that consists
of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable
to substantial parallelization, which in turns enables them to handle very
large data sets.
At the present time, Pig's infrastructure layer consists of a compiler
that produces sequences of Map-Reduce programs, for which
large-scale parallel implementations already exist (e.g., the Hadoop
subproject).
Pig's language layer currently consists of a textual language called Pig
Latin.

key properties:
• Ease of programming. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data analysis
tasks. Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data flow
sequences, making them easy to write, understand, and
maintain.
• Optimization opportunities. The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics
rather than efficiency.
• Extensibility. Users can create their own functions to do
special-purpose proc

Apache pig framework has below major components as part of its
Architecture:
• Parser
• Optimizer
• Compiler
• Execution Engine

Parser: Any pig scripts or commands in the grunt shell are
handled by the parser.
Parse will perform checks on the scripts like the syntax of the
scripts, do type checking and perform various other checks.
These checks will give output in a Directed Acyclic Graph
(DAG) form, which has a pig Latin statements and logical
operators.
The DAG will have nodes that are connected to different edges,
here our logical operator of the scripts are nodes and data flows
are edges.

• 2. Optimizer: As soon as parsing is completed and DAG is
generated, It is then passed to the logical optimizer to perform
logical optimization like projection and pushdown.
• Projection and pushdown are done to improve query
performance by omitting unnecessary columns or data and
prune the loader to only load the necessary column.

• 3. Compiler: The optimized logical plan generated above is
compiled by the compiler and generates a series of Map-
Reduce jobs.
• Basically compiler will convert pig job automatically into
MapReduce jobs and exploit optimizations opportunities in
scripts, due this programmer doesn’t have to tune the program
manually.
• As pig is a data-flow language its compiler can reorder the
execution sequence to optimize performance if the
execution plan remains the same as the original program.

• 4. Execution Engine: Finally, all the MapReduce jobs
generated via compiler are submitted to Hadoop in sorted
order. In the end, MapReduce’s job is executed on Hadoop
to produce the desired output.
• 5. Execution Mode: Pig works in two types of execution modes
depend on where the script is running and data availability :

• Local Mode: Local mode is best suited for small data sets.
• Pig is implemented here on single JVM as all files are installed
and run on localhost due to this parallel mapper execution is
not possible.
• Also while loading data pig will always look into the local file
system.

• MapReduce Mode (MR Mode): In MapReduce, the mode
programmer needs access and setup of the Hadoop cluster
and HDFS installation.
• In this mode data on which processing is done is exists in the
HDFS system.
• After execution of pig script in MR mode, pig Latin statement
is converted into Map Reduce jobs in the back-end to
perform the operations on the data. By default pig uses Map
Reduce mode, hence we don’t need to specify it using the -x
flag.

GRUNT
• Grunt is a Pig interactive shell.
• After invoking the Grunt shell, you can run your Pig
scripts in the shell.
• Commands: HDFS commands in PigGrunt
• 1. fs-ls /
• 2.fs –cat /
3. fs –mkdir /
4. fs –copyFromLocal

Shell commands in PigGrunt
• Any shell command can be invoked by sh and fs
• Sh ls command
• Sh cat
• Clear
• Help
• History
• set- assigns value to keys example
• > set job.name ‘myjob’
• Exec command
• Kill
• Run command
• quit

PigLatin
• Pig is a high-level platform or tool which is used to process the large
datasets.
• It provides a high-level of abstraction for processing over the
MapReduce.
• It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
• First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language.

• Internally Pig Engine(a component of Apache Pig) converted all
these scripts into a specific map and reduce task.
• But these are not visible to the programmers in order to provide a
high-level of abstraction.
• Pig Latin and Pig Engine are the two main components of the Apache
Pig tool. The result of Pig always stored in the HDFS.

Pig Latin statements
Pig Latin statements are generally organized as follows:
• A LOAD statement to read data from the file system.
• A series of "transformation" statements to process the
data.
• A DUMP statement to view results or a STORE
statement to save the results.
Note that a DUMP or STORE statement is required to
generate output.

Piglatin operators
• Arithmetic operator
• Relational operations
-load,store,filter,distinct,join,group,order,limit etc
• Comparision operator
• Type construction operator
()- tuple constructor,[]-map constructor,{}-bag constructor
• Diagnostic operator
1. dump,
2.describe-to verify the schema of a relation,
3.Explain-to verify the logical plan,physical plan and mapreduce plan of a relation
4.Illustration-to review how the data are transformed

HIVE
• It is a data warehouse software for providing data query and
analysis.
• Developed by Facebook and built on top of Apache Hadoop.
• Provides support for reading, writing, and managing large
dataset that is stored on Hadoop HDFS
• Hiveql
There are three core parts of Hive Architecture:-
• Hive Client
• Hive Services
• Hive Storage and Computer

• Hive Client
• Hive provides multiple drivers with multiple types of applications
for communication. Hive supports all apps written in
programming languages like Python, C++, Java, etc.
• There are three categorized this client-
• Hive Thrift Clients
• Hive JDBC Driver
• Hive ODBC Driver

A slide share pig in CCS334 for big data analytics

Recommended

Recommended

More Related Content

Similar to A slide share pig in CCS334 for big data analytics

Similar to A slide share pig in CCS334 for big data analytics (20)

More from KrishnaVeni451953

More from KrishnaVeni451953 (7)

Recently uploaded

Recently uploaded (20)

A slide share pig in CCS334 for big data analytics