PIG
Apache Pig is a platform for analyzing large data sets that consists
of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable
to substantial parallelization, which in turns enables them to handle very
large data sets.
At the present time, Pig's infrastructure layer consists of a compiler
that produces sequences of Map-Reduce programs, for which
large-scale parallel implementations already exist (e.g., the Hadoop
subproject).
Pig's language layer currently consists of a textual language called Pig
Latin.
key properties:
• Ease of programming. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data analysis
tasks. Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data flow
sequences, making them easy to write, understand, and
maintain.
• Optimization opportunities. The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics
rather than efficiency.
• Extensibility. Users can create their own functions to do
special-purpose proc
Apache pig framework has below major components as part of its
Architecture:
• Parser
• Optimizer
• Compiler
• Execution Engine
ARCHITECTURE
Parser: Any pig scripts or commands in the grunt shell are
handled by the parser.
Parse will perform checks on the scripts like the syntax of the
scripts, do type checking and perform various other checks.
These checks will give output in a Directed Acyclic Graph
(DAG) form, which has a pig Latin statements and logical
operators.
The DAG will have nodes that are connected to different edges,
here our logical operator of the scripts are nodes and data flows
are edges.
• 2. Optimizer: As soon as parsing is completed and DAG is
generated, It is then passed to the logical optimizer to perform
logical optimization like projection and pushdown.
• Projection and pushdown are done to improve query
performance by omitting unnecessary columns or data and
prune the loader to only load the necessary column.
• 3. Compiler: The optimized logical plan generated above is
compiled by the compiler and generates a series of Map-
Reduce jobs.
• Basically compiler will convert pig job automatically into
MapReduce jobs and exploit optimizations opportunities in
scripts, due this programmer doesn’t have to tune the program
manually.
• As pig is a data-flow language its compiler can reorder the
execution sequence to optimize performance if the
execution plan remains the same as the original program.
• 4. Execution Engine: Finally, all the MapReduce jobs
generated via compiler are submitted to Hadoop in sorted
order. In the end, MapReduce’s job is executed on Hadoop
to produce the desired output.
• 5. Execution Mode: Pig works in two types of execution modes
depend on where the script is running and data availability :
• Local Mode: Local mode is best suited for small data sets.
• Pig is implemented here on single JVM as all files are installed
and run on localhost due to this parallel mapper execution is
not possible.
• Also while loading data pig will always look into the local file
system.
• MapReduce Mode (MR Mode): In MapReduce, the mode
programmer needs access and setup of the Hadoop cluster
and HDFS installation.
• In this mode data on which processing is done is exists in the
HDFS system.
• After execution of pig script in MR mode, pig Latin statement
is converted into Map Reduce jobs in the back-end to
perform the operations on the data. By default pig uses Map
Reduce mode, hence we don’t need to specify it using the -x
flag.
GRUNT
• Grunt is a Pig interactive shell.
• After invoking the Grunt shell, you can run your Pig
scripts in the shell.
• Commands: HDFS commands in PigGrunt
• 1. fs-ls /
• 2.fs –cat /
3. fs –mkdir /
4. fs –copyFromLocal
Shell commands in PigGrunt
• Any shell command can be invoked by sh and fs
• Sh ls command
• Sh cat
• Clear
• Help
• History
• set- assigns value to keys example
• > set job.name ‘myjob’
• Exec command
• Kill
• Run command
• quit
PigLatin
• Pig is a high-level platform or tool which is used to process the large
datasets.
• It provides a high-level of abstraction for processing over the
MapReduce.
• It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
• First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language.
• Internally Pig Engine(a component of Apache Pig) converted all
these scripts into a specific map and reduce task.
• But these are not visible to the programmers in order to provide a
high-level of abstraction.
• Pig Latin and Pig Engine are the two main components of the Apache
Pig tool. The result of Pig always stored in the HDFS.
Pig Latin statements
Pig Latin statements are generally organized as follows:
• A LOAD statement to read data from the file system.
• A series of "transformation" statements to process the
data.
• A DUMP statement to view results or a STORE
statement to save the results.
Note that a DUMP or STORE statement is required to
generate output.
Piglatin operators
• Arithmetic operator
• Relational operations
-load,store,filter,distinct,join,group,order,limit etc
• Comparision operator
• Type construction operator
()- tuple constructor,[]-map constructor,{}-bag constructor
• Diagnostic operator
1. dump,
2.describe-to verify the schema of a relation,
3.Explain-to verify the logical plan,physical plan and mapreduce plan of a relation
4.Illustration-to review how the data are transformed
HIVE
• It is a data warehouse software for providing data query and
analysis.
• Developed by Facebook and built on top of Apache Hadoop.
• Provides support for reading, writing, and managing large
dataset that is stored on Hadoop HDFS
• Hiveql
There are three core parts of Hive Architecture:-
• Hive Client
• Hive Services
• Hive Storage and Computer
Hive architecture:
• Hive Client
• Hive provides multiple drivers with multiple types of applications
for communication. Hive supports all apps written in
programming languages like Python, C++, Java, etc.
• There are three categorized this client-
• Hive Thrift Clients
• Hive JDBC Driver
• Hive ODBC Driver

A slide share pig in CCS334 for big data analytics

  • 1.
    PIG Apache Pig isa platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin.
  • 2.
    key properties: • Easeof programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. • Extensibility. Users can create their own functions to do special-purpose proc
  • 3.
    Apache pig frameworkhas below major components as part of its Architecture: • Parser • Optimizer • Compiler • Execution Engine
  • 4.
  • 5.
    Parser: Any pigscripts or commands in the grunt shell are handled by the parser. Parse will perform checks on the scripts like the syntax of the scripts, do type checking and perform various other checks. These checks will give output in a Directed Acyclic Graph (DAG) form, which has a pig Latin statements and logical operators. The DAG will have nodes that are connected to different edges, here our logical operator of the scripts are nodes and data flows are edges.
  • 6.
    • 2. Optimizer:As soon as parsing is completed and DAG is generated, It is then passed to the logical optimizer to perform logical optimization like projection and pushdown. • Projection and pushdown are done to improve query performance by omitting unnecessary columns or data and prune the loader to only load the necessary column.
  • 7.
    • 3. Compiler:The optimized logical plan generated above is compiled by the compiler and generates a series of Map- Reduce jobs. • Basically compiler will convert pig job automatically into MapReduce jobs and exploit optimizations opportunities in scripts, due this programmer doesn’t have to tune the program manually. • As pig is a data-flow language its compiler can reorder the execution sequence to optimize performance if the execution plan remains the same as the original program.
  • 8.
    • 4. ExecutionEngine: Finally, all the MapReduce jobs generated via compiler are submitted to Hadoop in sorted order. In the end, MapReduce’s job is executed on Hadoop to produce the desired output. • 5. Execution Mode: Pig works in two types of execution modes depend on where the script is running and data availability :
  • 9.
    • Local Mode:Local mode is best suited for small data sets. • Pig is implemented here on single JVM as all files are installed and run on localhost due to this parallel mapper execution is not possible. • Also while loading data pig will always look into the local file system.
  • 10.
    • MapReduce Mode(MR Mode): In MapReduce, the mode programmer needs access and setup of the Hadoop cluster and HDFS installation. • In this mode data on which processing is done is exists in the HDFS system. • After execution of pig script in MR mode, pig Latin statement is converted into Map Reduce jobs in the back-end to perform the operations on the data. By default pig uses Map Reduce mode, hence we don’t need to specify it using the -x flag.
  • 11.
    GRUNT • Grunt isa Pig interactive shell. • After invoking the Grunt shell, you can run your Pig scripts in the shell. • Commands: HDFS commands in PigGrunt • 1. fs-ls / • 2.fs –cat / 3. fs –mkdir / 4. fs –copyFromLocal
  • 12.
    Shell commands inPigGrunt • Any shell command can be invoked by sh and fs • Sh ls command • Sh cat • Clear • Help • History • set- assigns value to keys example • > set job.name ‘myjob’ • Exec command • Kill • Run command • quit
  • 13.
    PigLatin • Pig isa high-level platform or tool which is used to process the large datasets. • It provides a high-level of abstraction for processing over the MapReduce. • It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. • First, to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language.
  • 14.
    • Internally PigEngine(a component of Apache Pig) converted all these scripts into a specific map and reduce task. • But these are not visible to the programmers in order to provide a high-level of abstraction. • Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig always stored in the HDFS.
  • 15.
    Pig Latin statements PigLatin statements are generally organized as follows: • A LOAD statement to read data from the file system. • A series of "transformation" statements to process the data. • A DUMP statement to view results or a STORE statement to save the results. Note that a DUMP or STORE statement is required to generate output.
  • 16.
    Piglatin operators • Arithmeticoperator • Relational operations -load,store,filter,distinct,join,group,order,limit etc • Comparision operator • Type construction operator ()- tuple constructor,[]-map constructor,{}-bag constructor • Diagnostic operator 1. dump, 2.describe-to verify the schema of a relation, 3.Explain-to verify the logical plan,physical plan and mapreduce plan of a relation 4.Illustration-to review how the data are transformed
  • 17.
    HIVE • It isa data warehouse software for providing data query and analysis. • Developed by Facebook and built on top of Apache Hadoop. • Provides support for reading, writing, and managing large dataset that is stored on Hadoop HDFS • Hiveql There are three core parts of Hive Architecture:- • Hive Client • Hive Services • Hive Storage and Computer
  • 18.
  • 19.
    • Hive Client •Hive provides multiple drivers with multiple types of applications for communication. Hive supports all apps written in programming languages like Python, C++, Java, etc. • There are three categorized this client- • Hive Thrift Clients • Hive JDBC Driver • Hive ODBC Driver