This document outlines the course content for a Big Data Analytics course. The course covers key concepts related to big data including Hadoop, MapReduce, HDFS, YARN, Pig, Hive, NoSQL databases and analytics tools. The 5 units cover introductions to big data and Hadoop, MapReduce and YARN, analyzing data with Pig and Hive, and NoSQL data management. Experiments related to big data are also listed.
1. Course Instructor: Dr. C. Sreedhar
BIG DATA ANALYTICS
B.Tech VII Sem CSE A
*Note: Some images are downloaded and used from internet sources
2. Unit I
What is Big Data Analytics
Why this sudden hype around big data analytics
Classification of Analytics
Top Challenges facing big data
Few top analytics tools
Introduction to Hadoop;
HDFS, HDFS Commands
Processing Data with Hadoop
Managing Resources and Applications with Hadoop YARN
Interacting with Hadoop Ecosystem
3. Unit II
Understanding MapReduce & YARN:
The Map Reduce Framework Concept
Developing Simple MapReduce Application
Points to consider while designing mapreduce
YARN background
YARN architecture
Working of YARN
4. Unit III
Analyzing Data with Pig
Introducing Pig
Running Pig
Getting started with pig latin
Working with operators in pig
Debugging pig
5. Unit IV
Understanding HIVE:
Introducing Hive
Hive services
Builtin functions in Hive
Hive DDL
Data manipulation in Hive
6. Unit V
NoSQL Data Management:
Introducing to NoSQL,
characteristics of NoSQL
Types of NoSQL data models
Schema less databases
9. Big Data: common misconceptions
Expensive
Machine Data
Quality Data
Always right
100% accurate
Big Data is NOT:
A Self-Learning Algorithm
Solution for every Business
Meant only for Data
Scientists
Magic that changes
overnight
9
10. Traditional method of file management
Patients
Doctors
Wards
Rooms
Patients program
Doctors program
Wards program
Rooms program
Users
Users
Users
Users
11. What Big Data is
Big Data is about the extraction of actionable or
useful information from very large datasets.
Big data is a field that treats ways to analyze,
systematically extract information from, or otherwise
deal with data sets that are too large or complex to
be dealt with by traditional data-processing
application software.
11
12. Big Data
The importance of big data does not revolve around
how much data the organization/company has, but
what can be done with such massive volumes of data.
Big Data helps in
cost reductions,
time reductions,
new product development and optimized offerings and
smart decision making.
12
13. Big Data in the real world
By 2020, there will be around 40 trillion gigabytes of data
(40 zettabytes).
90% of all data has been created in the last few years.
Today it would take a person approximately 181 million years
to download all the data from the internet.
In 2012, only 0.5% of all data was analyzed.
In 2018, internet users spent 2.8 million years online.
Social media accounts for 33% of the total time spent online.
13
14. Big Data in the real world
97.2% of organizations are investing in big data and AI.
Using big data, Netflix saves $1 billion per year on
customer retention.
Job listings for data science and analytics will reach
around 2.7 million by 2020.
Automated analytics will be vital to big data by 2022.
The big data analytics market is set to reach $103 billion
by 2023
14
15. What Big Data is
Big datasets are too large and complex to be
processed by traditional methods.
Considering in a single minute, there are approx.:
3,00,000 Instagrams posted
5,00,000 tweets sent
45,00,000 Youtube videos watched
45,00,000 Google searches
20 crores of emails sent
15
16. Big Data
How do organizations optimize the values of big data?
Set a big data strategy
Identify big data sources
Access, manage and store big data
Analyze big data
Make data-driven decisions
16
17. Big Data: Definition
Gartner:
Big data is high-volume, high-velocity and/or
high-variety information assets that demand cost-
effective, innovative forms of information
processing that enable enhanced insight, decision
making, and process automation.
17
19. What is Big Data Analytics
Analytics in general, involves the use of mathematical
or scientific methods to generate insight from data
Big data analytics is the use of advanced analytic
techniques against very large, diverse data sets that
include structured, semi-structured and unstructured
data, from different sources, and in different sizes
from terabytes to zettabytes.
20. What is Big Data Analytics
Technology-enabled analytics:
Quite a few data analytics and visualization tools are available
in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistica, World Programming Systems (WPS),
etc. to help process and analyze big data.
About gaining a meaningful, deeper, and richer insight into
the business to steer it in the right direction, understanding
the customers demographics to cross-sell and up-sell to them.
21. What is Big Data Analytics
Handshake between three communities: IT, business users,
and data scientists.
Working with datasets whose volume and variety exceed
current storage, processing capabilities and infrastructure.
About moving code to data. This makes perfect sense as
the program for distributed processing is tiny (just a few
KBs) compared to the data (Terabytes or Petabytes today
and likely to be Exabytes or Zettabytes in near future).
22. Why this sudden hype around BDA?
Following are some of the reasons for sudden hype about BDA:
Data is growing at a 40% compound annual rate, reaching nearly 45
ZB by 2020.
Volume of business data worldwide is expected to double every 1.2 y.
500 million ―tweets‖ are posted by Twitter users every day.
2.7 billion ―Likes‖, comments posted by Facebook users in a day.
90% of the world’s data created in the past 2 years.
Cost per gigabyte of storage has hugely dropped.
There are an overwhelming number of user-friendly analytics tools
available in the market today.
23. Classification of Analytics
There are basically two schools of thought:
Those that classify analytics into basic,
operationalized, advanced, and monetized.
Those that classify analytics into analytics 1.0,
analytics 2.0, analytics 3.0 and analytics 4.0.
24. First School of thought
Basic analytics:
used to explore your data in a graphical manner where the data provides some
value through simple visualizations
Operationalized analytics:
Operationalized analytics includes several concepts like data discovery, decision
management, information delivery
Advanced analytics:
Provide analytical algorithms for executing complex analysis of either structured
or unstructured data
Monetized analytics:
This is analytics in use to derive direct business revenue.
25. Second School of Analytics
Analytics 1.0
Data sources relatively small and
structured, from internal systems
Majority of analytical activity
was descriptive analytics, or
reporting
Creating analytical models was a
time-consuming batch process
Decisions were made based on
experience and intuition
Analytics 2. 0
Data sources are big,
complex, unstructured, fast
moving data
Rise of Data Scientists
Rise of Hadoop & open
source
Visual Analytics
26. Second School of Analytics
Analytics 3.0
Mix of all data
Internal/external
products/decisions
Analytics a core capability
Move at speed & scale
Predictive & prescriptive
analytics
Analytics 4.0
Analytics embedded,
invisible and automated
Cognitive technologies
Robotic process automation
for digital tasks
Augmentation and not
automation
30. Apache Hadoop
is a collection of open-source software utilities
that facilitate using a network of many
computers to solve problems involving massive
amounts of data and computation.
It provides a software framework for distributed
storage and processing of big data using the
MapReduce programming model
31. Apache Spark
is an open-source distributed general-purpose
cluster-computing framework.
is a unified analytics engine for large-scale data
processing.
provides an interface for programming entire
clusters with implicit data parallelism and fault
tolerance
32. Apache Storm
A system for processing streaming data in real time
adds reliable real-time data processing capabilities
to Enterprise Hadoop
Is distribute, resilent and real time
33. Cassandra
is a free and open-source, distributed, wide column
store, NoSQL database management system
designed to handle large amounts of data across
many commodity servers, providing high availability
with no single point of failure
is the right choice when you need scalability and
high availability without compromising performance
34. Tableau
Tableau Empowers business users to quickly and
easily find valuable insights in their vast Hadoop
datasets.
Tableau removes the need for users to have
advanced knowledge of query languages by
providing a clean visual analysis interface
35. Lumify
LUMIFY is a powerful big data fusion, analysis, and
visualization platform that supports the
development of actionable intelligence.
Lumify is possibly the choice for those pouring over
the 11 million-plus documents
36. Windows Azure
is a cloud computing service created by Microsoft
for building, testing, deploying, and managing
applications and services through Microsoft-
managed data centers
37. Splunk
Splunk is a software platform to search, analyze
and visualize the machine-generated data
gathered from the websites, applications, sensors,
devices etc. which make up IT infrastructure and
business.
38. Talend
Talend is an open source software integration
platform helps you in effortlessly turning this data
into business insights.
provides various software and services for data
integration, data management, enterprise
application integration, data quality, cloud storage
and Big Data.
39. HBase
is an open-source non-relational distributed
database modeled after Google's Bigtable and
written in Java.
It is developed as part of Apache Software
Foundation's Apache Hadoop project and runs on
top of HDFS, providing Bigtable-like capabilities for
Hadoop
40. Hive
is a data warehouse software project built on top of
Apache Hadoop for providing data query and
analysis.
Hive gives an SQL-like interface to query data
stored in various databases and file systems that
integrate with Hadoop
41. Apache Pig
is a high-level platform for creating programs that
run on Apache Hadoop.
It is a tool/platform which is used to analyze larger
sets of data representing them as data flows.
perform all the data manipulation operations in
Hadoop using Apache Pig.
42. Introduction to Hadoop
Hadoop is an open source framework that allows to
store (HDFS) and process (MapReduce) large data sets
in distributed and parallel manner.
43. Traditional DB vs. Hadoop
Traditional Database System Hadoop
Data is stored in a central location and
sent to the processor at runtime.
In Hadoop, the program goes to the data. It
initially distributes the data to multiple systems
and later runs the computation wherever the
data is located. (distributed computation)
Traditional Database Systems cannot be
used to process and store a significant
amount of data(big data).
Hadoop works better when the data size is big.
It can process and store a large amount of
data efficiently and effectively.
Traditional RDBMS is used to manage
only structured and semi-structured
data. It cannot be used to control
unstructured data.
Hadoop can process and store a variety of
data, whether it is structured or unstructured.
44. History of Hadoop
1997: Doug Cutting, developed Lucene; Open source search and indexing Java
based indexing and open source search software.
2001: Mike Cafarella, focused on indexing entire web
Problems:
Schema less (no tables and columns)
Durable (once written should never lost)
Capability of handling component failure ( CPU, Memory, N/w)
Automatically re-balanced (disk space consumption)
2003: Google published GFS Paper; developed Nutch DFS.
Problem of durability and fault tolerance was still not solved.
45. History of Hadoop…
Sol: Distributed processing; Divided file system into 64mb chunks, storing each
element on 3 different nodes(replication factor).
2004: Google published a paper: Map Reduce – Simple Data Processing on Large
Clusters.
Solved problem of Parallelization, Distribution, Fault Tolerance.
2005: Cutting integrated Map Reduce into Nutch.
2006: Cutting named it Hadoop, included HDFS, MR.
2008: Cutting licensed it under Apache Software Foundations.
Certain problems/enhancements in Hadoop, created sub projects like Hive, PIG,
HBase, Zookeeper.
46. Hadoop
Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs −
Data is divided into directories and files (uniform sized blocks 128MB)
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checks for successful execution of the code
Performs sort phase that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
47. Hadoop as a good
choice for:
Indexing log files
Sorting vast amounts
of data
Image analysis
Search engine
optimization
Analytics
Hadoop as a poor
choice for:
Calculating value of pi to
1,000,000 digits
Calculating Fibonacci
sequences
Small structured data
49. HDFS
HDFS is a distributed file system that is fault tolerant,
scalable and extremely easy to expand.
HDFS is the primary distributed storage for Hadoop
applications.
HDFS provides interfaces for applications to move
themselves closer to data.
HDFS is designed to process large data sets with
write-once-read-many semantics
51. Namenode
It is master daemon that maintains and manages the DataNodes (slave nodes)
It records metadata of all blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to file system metadata
If a file is deleted in HDFS, NameNode will immediately record this in EditLog
It regularly receives a Heartbeat and a block report from all the DataNodes
in the cluster to ensure that the DataNodes are alive
It keeps record of all blocks in HDFS and DataNode in which they are stored
52. DataNode
It is the slave daemon which runs on each slave machine
The actual data is stored on DataNodes
It is responsible for serving read and write requests from the clients
It is also responsible for creating blocks, deleting blocks and
replicating the same based on the decisions taken by the
NameNode
It sends heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds
53. Secondary Namenode
Works as a helper node to primary NameNode
Downloads the Fsimage file and edit logs file from
NameNode
Reads from RAM of NameNode and stores it to hard
disks periodically.
If NameNode fails, last save Fsimage on secondary
NameNode can be used to recover file system
metadata
55. List: Files
hdfs dfs -ls / List all the files/directories for the given
hdfs destination path.
hdfs dfs -ls -d /inputnew2 Directories are listed as plain files. In this
case, this command will list the details of
inputnew2 folder.
hdfs dfs -ls -R /user Recursively list all files in hadoop directory
and all subdirectories in user directory.
hdfs dfs -ls /inputnew*
List all the files matching the pattern. In
this case, it will list all the files inside
present directory which starts with
'inputnew'.
56. Read / Write Files
hdfs dfs -text /inputnew/inputFile.txt
HDFS Command that takes a source file
and outputs the file in text format on the
terminal.
hdfs dfs -cat /inputnew/inputFile.txt This command will display the content of
the HDFS file test on your stdout .
hdfs dfs -appendToFile
/home/ubuntu/test1
/hadoop/text2
Appends the content of a local file test1
to a hdfs file test2.
57. Upload / Download Files
hdfs dfs -put /home/ubuntu/sample
/hadoop
Copies the file from local file system to HDFS.
hdfs dfs -put -f
/home/ubuntu/sample /hadoop
Copies the file from local file system to HDFS,
and in case the local already exits in the
given destination path, using -f option with
put command will overwrite it.
hdfs dfs -get /newfile /home/ubuntu/ Copies the file from HDFS to local file system.
hdfs dfs -copyFromLocal
/home/ubuntu/sample /hadoop
Works similarly to the put command, except
that the source is restricted to a local file
reference.
hdfs dfs -copyToLocal /newfile
/home/ubuntu/
Works similarly to the put command, except
that the destination is restricted to a local file
reference.
58. File management
hdfs dfs -cp /hadoop/file1 /hadoop1
Copies file from source to destination on
HDFS. In this case, copying file1 from hadoop
directory to hadoop1 directory.
hdfs dfs -cp -f /hadoop/file1 /hadoop1
Copies file from source to destination on
HDFS. Passing -f overwrites the destination if
it already exists.
hdfs dfs -mv /hadoop/file1 /hadoop1
Move files that match the specified file pattern
<src> to a destination <dst>. When moving
multiple files, destination must be a directory.
hdfs dfs -rm /hadoop/file1 Deletes the file (sends it to the trash).
hdfs dfs -rmdir /hadoop1 Delete a directory.
hdfs dfs -mkdir /hadoop2 Create a directory in specified HDFS location.
59. Permissions
hdfs dfs -touchz /hadoop3
Creates a file of zero length at <path>
with current time as the timestamp of
that <path>.
hdfs dfs -chmod 755 /hadoop/file1 Changes permissions of file.
hdfs dfs -chown ubuntu:ubuntu
/hadoop
Changes owner of the file. 1st ubuntu
in the command is owner and
2nd one is group.
60. File System
hdfs dfs -df /hadoop Shows the capacity, free and used space of
the filesystem.
hdfs dfs -df -h /hadoop
Shows the capacity, free and used space of
the filesystem. -h parameter Formats the
sizes of files in a human-readable fashion.
hdfs dfs -du /hadoop/file
Show the amount of space, in bytes, used by
the files that match the specified file pattern.
hdfs dfs -du -s /hadoop/file
Rather than showing the size of each
individual file that matches the pattern,
shows the total (summary) size.
61. Administration
hadoop version To check the vesrion of Hadoop.
hdfs fsck / It checks the health of the Hadoop file system.
hdfs dfsadmin -safemode leave The command to turn off the safemode of
NameNode.
hdfs dfsadmin –refreshNodes
Re-read the hosts and exclude files to update the
set of Datanodes that are allowed to connect to the
Namenode and those that should be
decommissioned or recommissioned.
hdfs namenode -format Formats the NameNode.
62.
63. HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Data processing using programming
Spark: In-memory Data Processing
PIG, HIVE: Data Processing Services using Query (SQL-like)
Hbase: NoSQL Database
Mahout, Spark Mllib: Machine Learning
Apache Drill: SQL on Hadoop
Zookeeper: Managing Cluster
Oozie: Job Scheduling
Flume, Sqoop: Data Ingesting Services
Solr& Lucene: Searching & Indexing
Ambari: Provision, Monitor and Maintain cluster
64. MapReduce Framework
Need for parallel distribution of tasks
Automatic expansion and contraction of processes
Enables continuation of processes w/o being affected by
network failures or system failures
MapReduce: Map and reduce
Do not modify the original data instead, create new data
structures to display their o/p
65. Features of MapReduce
Scheduling
Synchronization
Data locality
Handling of errors/faults
Scale out architecture
66. MapReduce Framework
Very large scale data: peta, exa bytes
Write once and read many data: allows for parallelism without mutexes
Map and Reduce are the main operations: simple code
There are other supporting operations such as combine and partition.
All the map should be completed before reduce operation starts.
Map and reduce operations are typically performed by the same physical processor.
Number of map tasks and reduce tasks are configurable.
Operations are provisioned near the data.
Commodity hardware and storage.
Special distributed file system. Example: Hadoop Distributed File
67. Working of MapReduce
MapReduce programming model works on an algorithm to execute the map and
reduce operations. Algorithm steps as follows:
Take a large dataset or set of records
Perform iteration over the data
Extract interesting patterns to prepare an o/p list using map function
Arrange o/p list to enable optimization for further processing
Compute a set of results by using reduce function
Provide the final output
MR mode executes given task by dividing into two functions: map and reduce.
Map function is executed first in parallel on different machines. Reduce function
takes output of map function to present final output in an aggregate form.
68. MapReduce framework
JobTracker receives the jobs from client applications to process
large information.
These jobs are assigned in the forms of individual tasks (after a
job is divided into smaller parts) to various TaskTrackers.
The task distribution is transmitted to the reduce function so that
the final, integrated output, which is an aggregated of the data
processed by the map funciton, can be provided.
A cluster uses commodity servers to store nodes. The data
processing job is accomplished through MapReduce and HDFS
69. Input is provided from large data files in the form of key-value pair,
which is the standard input format in a Hadoop MapReduce
programming model.
The input data is divided into small pieces, and master and slave
nodes are created. The master node usually executes on the
machine where the data is present, and slaves are made to work
remotely on the data
The map operation is performed simultaneously on all the data
pieces, which are read by the map function. The map function
extracts the relavent data and generates key-value pair for it.
70. Client: This initializes the job
JobTracker: is the master daemon for both Job resource management and
scheduling/monitoring of jobs
TaskTracker: is a slave node daemon in the cluster that accepts tasks (Map, Reduce and
Shuffle operations) from a JobTracker
Map task deal with splitting and mapping of data while Reduce task shuffle and
reduce the data
MapReduce Framework
71. Hadoop Modes
Local Pseudo
distributed
Fully
distributed
•No daemons
•executes all parts of
Hadoop MapReduce
within a single Java
process and uses the local
filesystem as the storage
•No DFS
•useful for testing/
debugging the
MapReduce applications
locally
we can run Hadoop on a
single machine emulating a
distributed cluster
runs different services of
Hadoop as different Java
processes, but within a single
machine.
Each hadoop daemon run as
separate java process.
•supports clusters
that span from a
few nodes to
thousands of nodes
•Full production run
72. Hadoop MapReduce: A Closer Look
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes
74. Resource Manager (RM)
Responsibilities: Resource management and
assignment of all the apps
Master daemon of YARN
Requests received by the RM are forwarded to the
corresponding node manager.
RM does the allocation of available resources.
RM is highest authority for the allocation of resources.
75. Node Manager (NM)
Generic, flexible and efficient than TaskTracker.
Dynamically created resource containers.
Container refers to a collection of resources such as
memory, CPU, disk and network IO.
NM is the slave daemon of Yarn.
NM is the per-machine/per-node framework agent,
responsible for containers, monitoring their resource usage
and reporting the same to the ResourceManager.
76. Container
Container in YARN is where a unit of work happens in the
form of task.
A job/application is split in tasks and each task gets
executed in one container having a specific amount of
allocated resources.
A container can be understood as logical reservation of
resources (memory and vCores) that will be utilized by task
running in that container
77. Application Master
Responsible for managing a set of submitted tasks or applications.
First verifies and validates submitted application’s specifications and
rejects the applications, if there are not enough resources available.
It also ensures no other application exists with the same ID which is
already submitted.
Finally, it also observes the states of applications and manages
finished applications to save some Resource Manager’s memory.
78. YARN: Yet Another Resource Negotiator
YARN: Resource management + Job scheduling
Criteria MapReduce YARN
Type of processing
batch processing with a single
engine
Real-time, batch, and
interactive processing with
multiple engines
Cluster resource
optimization
Average due to fixed Map and
Reduce slots
Excellent due to central
resource management
Suitable for Only MapReduce applications
MapReduce and non-
MapReduce applications
Managing cluster
resource
Done by JobTracker Done by YARN
Namespace
Supports only one namespace,
i.e., HDFS
Hadoop supports multiple
namespaces
79. Working of YARN
1.A client program submits the application
2.ResourceManager allocates a specified
container to start the ApplicationMaster (AM)
3. ApplicationMaster, on boot-up, registers with RM
4. AM negotiates with RM for appropriate resource
containers
5. On successful container allocations, AM contacts
NM to launch the container
6. Application code is executed within the
container, and then AM is responded with the
execution status
7. During execution, client communicates directly
with AM or RM to get status, progress updates etc.
8. Once the application is complete AM unregisters
with RM and shuts down, allowing its own container
process
80. Introduction: Apache Pig
Pig is a high level scripting language for operating on large datasets inside
Hadoop.
Compiles scripting language into MapReduce operations.
provides a simple language called Pig Latin, for queries and data
manipulation
Pig is multi-query approach reduces the number of times data is scanned.
Pig was developed for ad-hoc way of creating and executing MapReduce
jobs on very large data sets.
Pig provides data operations like filters, joins, ordering, etc. and nested data
types like tuples and maps, that are missing from MapReduce.
81. When to use and Not to use Pig
When data loads are
time sensitive.
When processing
various data sources.
When analytical
insights are required
through sampling.
In places where the data is
completely unstructured, like
video, audio and readable text.
In places where time constraints
exist, as Pig is slower than
MapReduce jobs.
In places where more power is
required to optimize the codes.
82. Applications of Pig
Processing of web logs.
Data processing for search platforms.
Provides the supports across large data-sets for Ad-hoc queries.
For exploring large datasets Pig Scripting is used.
In the prototyping of large data-sets processing algorithms.
Required to process the time sensitive data loads.
Collecting large amounts of datasets in form of search logs, web
crawls.
Used where the analytical insights are needed using the sampling.
83. Apache Pig MapReduce
It is a scripting language.
It is a compiled programming
language.
Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as
compared to MapReduce.
Lines of code is more.
Less effort is needed for Apache
Pig.
More development efforts are
required for MapReduce.
Code efficiency is less as
compared to MapReduce.
As compared to Pig efficiency of
code is higher.
84. Features of Pig
Rich set of operators
It provides many operators to perform operations like join, sort, filer, etc.
Ease of programming:
Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization opportunities:
The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus
only on semantics of the language.
Extensibility:
Using existing operators, users can develop their own functions to read, process, write data.
UDF’s:
Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data:
Apache Pig analyzes all kinds of data, both structured, unstructured.
It stores the results in HDFS.
85. Apache Pig: Introduction
The language used to analyze data in Hadoop using Pig is known
as Pig Latin.
To perform a particular task Programmers using Pig, programmers
need to write a Pig script using the Pig Latin language, and execute
them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded).
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
86. Parameter MapReduce Pig
Paradigm Data processing paradigm Procedural dataflow language
Type of language Low level and rigid A high-level language
Join operation
It is difficult to perform join
operations between datasets
Performing a join operation is
simple
Skills needed
The developer needs to have a robust
knowledge of Java
A good knowledge of SQL is
needed
Code length
MapReduce requires 20 times more
code length to accomplish the same
task
Due to the multi-query
approach, the code length is
greatly reduced
Compilation
MapReduce jobs have a prolonged
compilation process
No need for compilation as
every Pig operator is converted
internally into MapReduce jobs
Nested data types Not present in MapReduce Present in Pig
88. Running Pig
Pig Latin statements; Pig commands
Pig Latin statements and Pig commands can be run using interactive
mode and batch mode.
Pig Latin commands in three ways—via the Grunt interactive shell,
through a script file, and as embedded queries inside Java
programs.
Pig has six execution modes
Local mode, Tez Local mode, Spark local mode, Mapreduce mode, Tez mode,
Spark mode
89. Local Mode – This mode can be run on a single machine; all files are installed and
run using local host and file system. uses the -x flag (pig -x local).
Tez Local Mode – This mode is similar to local mode, except internally Pig will
invoke tez runtime engine. Uses -x flag (pig -x tez_local).
Spark Local Mode – This mode is similar to local mode, except internally Pig will
invoke spark runtime engine. Uses -x flag (pig -x spark_local).
Mapreduce Mode - To run Pig in mapreduce mode, access to a Hadoop cluster and
HDFS installation. Mapreduce mode is the default mode; Uses -x flag (pig OR pig -
x mapreduce).
Tez Mode – This mode, access to a Hadoop cluster and HDFS installation. Specify
Tez mode using the -x flag (-x tez).
Spark Mode - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos
cluster and HDFS installation. Specify Spark mode using the -x flag (-x spark)
90. Interactive mode
Local Mode : $ pig -x local
Tez Local Mode : $ pig -x tez_local
Spark Local Mode : $ pig -x spark_local
Mapreduce Mode : $ pig -x mapreduce
Tez Mode : $ pig -x tez
Spark Mode : $ pig -x spark
91. Batch mode
Local Mode : $ pig -x local id.pig
Tez Local Mode : $ pig -x tez_local id.pig
Spark Local Mode : $ pig -x spark_local id.pig
Mapreduce Mode : $ pig id.pig
Tez Mode : $ pig -x tez id.pig
Spark Mode : $ pig -x spark id.pig
92. Pig Latin statements
Pig Latin statements are basic constructs used to process data
using Pig.
A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
Pig Latin statements may include expressions and schemas.
Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; ).
By default, Pig Latin statements are processed using multi-query
execution.
93. Pig Latin statements are generally organized as follows:
A LOAD statement to read data from the file system.
A series of "transformation" statements to process the data.
A DUMP statement to view results or a STORE statement to save the results.
94. The following example, Pig will validate, but not execute, the LOAD and FOREACH
statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
In the following example, Pig will validate and then execute the LOAD, FOREACH, and
DUMP statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(Ram)
(Sam)
(Ham)
(Pam)