Big Data Analytics

Course Instructor: Dr. C. Sreedhar
BIG DATA ANALYTICS
B.Tech VII Sem CSE A
*Note: Some images are downloaded and used from internet sources

Unit I
 What is Big Data Analytics
 Why this sudden hype around big data analytics
 Classification of Analytics
 Top Challenges facing big data
 Few top analytics tools
 Introduction to Hadoop;
 HDFS, HDFS Commands
 Processing Data with Hadoop
 Managing Resources and Applications with Hadoop YARN
 Interacting with Hadoop Ecosystem

Unit II
 Understanding MapReduce & YARN:
 The Map Reduce Framework Concept
 Developing Simple MapReduce Application
 Points to consider while designing mapreduce
 YARN background
 YARN architecture
 Working of YARN

Unit III
 Analyzing Data with Pig
 Introducing Pig
 Running Pig
 Getting started with pig latin
 Working with operators in pig
 Debugging pig

Unit IV
 Understanding HIVE:
 Introducing Hive
 Hive services
 Builtin functions in Hive
 Hive DDL
 Data manipulation in Hive

Unit V
 NoSQL Data Management:
 Introducing to NoSQL,
 characteristics of NoSQL
 Types of NoSQL data models
 Schema less databases

Big Data: Introduction
8
What ?
Why ?
Who ?
How ?
Existing ?
When ?
Applications ?

Big Data: common misconceptions
 Expensive
 Machine Data
 Quality Data
 Always right
 100% accurate
Big Data is NOT:
 A Self-Learning Algorithm
 Solution for every Business
 Meant only for Data
Scientists
 Magic that changes
overnight
9

Traditional method of file management
Patients
Doctors
Wards
Rooms
Patients program
Doctors program
Wards program
Rooms program
Users
Users
Users
Users

What Big Data is
 Big Data is about the extraction of actionable or
useful information from very large datasets.
 Big data is a field that treats ways to analyze,
systematically extract information from, or otherwise
deal with data sets that are too large or complex to
be dealt with by traditional data-processing
application software.
11

Big Data
 The importance of big data does not revolve around
how much data the organization/company has, but
what can be done with such massive volumes of data.
 Big Data helps in
 cost reductions,
 time reductions,
 new product development and optimized offerings and
 smart decision making.
12

Big Data in the real world
 By 2020, there will be around 40 trillion gigabytes of data
(40 zettabytes).
 90% of all data has been created in the last few years.
 Today it would take a person approximately 181 million years
to download all the data from the internet.
 In 2012, only 0.5% of all data was analyzed.
 In 2018, internet users spent 2.8 million years online.
 Social media accounts for 33% of the total time spent online.
13

Big Data in the real world
 97.2% of organizations are investing in big data and AI.
 Using big data, Netflix saves $1 billion per year on
customer retention.
 Job listings for data science and analytics will reach
around 2.7 million by 2020.
 Automated analytics will be vital to big data by 2022.
 The big data analytics market is set to reach $103 billion
by 2023
14

What Big Data is
Big datasets are too large and complex to be
processed by traditional methods.
Considering in a single minute, there are approx.:
 3,00,000 Instagrams posted
 5,00,000 tweets sent
 45,00,000 Youtube videos watched
 45,00,000 Google searches
 20 crores of emails sent
15

Big Data
How do organizations optimize the values of big data?
 Set a big data strategy
 Identify big data sources
 Access, manage and store big data
 Analyze big data
 Make data-driven decisions
16

Big Data: Definition
Gartner:
 Big data is high-volume, high-velocity and/or
high-variety information assets that demand cost-
effective, innovative forms of information
processing that enable enhanced insight, decision
making, and process automation.
17

Characteristics
of Big Data
 Volume
 Variety
 Velocity
 Veracity
18

What is Big Data Analytics
 Analytics in general, involves the use of mathematical
or scientific methods to generate insight from data
 Big data analytics is the use of advanced analytic
techniques against very large, diverse data sets that
include structured, semi-structured and unstructured
data, from different sources, and in different sizes
from terabytes to zettabytes.

 Technology-enabled analytics:
 Quite a few data analytics and visualization tools are available
in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistica, World Programming Systems (WPS),
etc. to help process and analyze big data.
 About gaining a meaningful, deeper, and richer insight into
the business to steer it in the right direction, understanding
the customers demographics to cross-sell and up-sell to them.

 Handshake between three communities: IT, business users,
and data scientists.
 Working with datasets whose volume and variety exceed
current storage, processing capabilities and infrastructure.
 About moving code to data. This makes perfect sense as
the program for distributed processing is tiny (just a few
KBs) compared to the data (Terabytes or Petabytes today
and likely to be Exabytes or Zettabytes in near future).

Why this sudden hype around BDA?
 Following are some of the reasons for sudden hype about BDA:
 Data is growing at a 40% compound annual rate, reaching nearly 45
ZB by 2020.
 Volume of business data worldwide is expected to double every 1.2 y.
 500 million ―tweets‖ are posted by Twitter users every day.
 2.7 billion ―Likes‖, comments posted by Facebook users in a day.
 90% of the world’s data created in the past 2 years.
 Cost per gigabyte of storage has hugely dropped.
 There are an overwhelming number of user-friendly analytics tools
available in the market today.

Classification of Analytics
 There are basically two schools of thought:
 Those that classify analytics into basic,
operationalized, advanced, and monetized.
 Those that classify analytics into analytics 1.0,
analytics 2.0, analytics 3.0 and analytics 4.0.

First School of thought
Basic analytics:
used to explore your data in a graphical manner where the data provides some
value through simple visualizations
Operationalized analytics:
Operationalized analytics includes several concepts like data discovery, decision
management, information delivery
Advanced analytics:
Provide analytical algorithms for executing complex analysis of either structured
or unstructured data
Monetized analytics:
This is analytics in use to derive direct business revenue.

Second School of Analytics
 Analytics 1.0
 Data sources relatively small and
structured, from internal systems
 Majority of analytical activity
was descriptive analytics, or
reporting
 Creating analytical models was a
time-consuming batch process
 Decisions were made based on
experience and intuition
 Analytics 2. 0
 Data sources are big,
complex, unstructured, fast
moving data
 Rise of Data Scientists
 Rise of Hadoop & open
source
 Visual Analytics

Second School of Analytics
 Analytics 3.0
 Mix of all data
 Internal/external
products/decisions
 Analytics a core capability
 Move at speed & scale
 Predictive & prescriptive
analytics
 Analytics 4.0
 Analytics embedded,
invisible and automated
 Cognitive technologies
 Robotic process automation
for digital tasks
 Augmentation and not
automation

Classification of Analytics
 Descriptive analytics
 Diagnostic analytics
 Predictive analytics
 Prescriptive analytics

Challenges in Big Data
 capture
 cleaning
 Availability
 integration
 storage
 processing
 indexing
 Security
 Sharing
 Consistency
 Partition tolerant
 Analysis
 Visualization

Top Analytics Tools
 Apache Hadoop
 Apache Spark
 Apache Storm
 Apache Cassandra
 Tableau
 HBase
 Windows Azure
 Splunk
 Talend
 Elastic
 Apache Pig
 Lumify

Apache Hadoop
 is a collection of open-source software utilities
that facilitate using a network of many
computers to solve problems involving massive
amounts of data and computation.
 It provides a software framework for distributed
storage and processing of big data using the
MapReduce programming model

Apache Spark
 is an open-source distributed general-purpose
cluster-computing framework.
 is a unified analytics engine for large-scale data
processing.
 provides an interface for programming entire
clusters with implicit data parallelism and fault
tolerance

Apache Storm
 A system for processing streaming data in real time
 adds reliable real-time data processing capabilities
to Enterprise Hadoop
 Is distribute, resilent and real time

Cassandra
 is a free and open-source, distributed, wide column
store, NoSQL database management system
designed to handle large amounts of data across
many commodity servers, providing high availability
with no single point of failure
 is the right choice when you need scalability and
high availability without compromising performance

Tableau
 Tableau Empowers business users to quickly and
easily find valuable insights in their vast Hadoop
datasets.
 Tableau removes the need for users to have
advanced knowledge of query languages by
providing a clean visual analysis interface

Lumify
 LUMIFY is a powerful big data fusion, analysis, and
visualization platform that supports the
development of actionable intelligence.
 Lumify is possibly the choice for those pouring over
the 11 million-plus documents

Windows Azure
 is a cloud computing service created by Microsoft
for building, testing, deploying, and managing
applications and services through Microsoft-
managed data centers

Splunk
 Splunk is a software platform to search, analyze
and visualize the machine-generated data
gathered from the websites, applications, sensors,
devices etc. which make up IT infrastructure and
business.

Talend
 Talend is an open source software integration
platform helps you in effortlessly turning this data
into business insights.
 provides various software and services for data
integration, data management, enterprise
application integration, data quality, cloud storage
and Big Data.

HBase
 is an open-source non-relational distributed
database modeled after Google's Bigtable and
written in Java.
 It is developed as part of Apache Software
Foundation's Apache Hadoop project and runs on
top of HDFS, providing Bigtable-like capabilities for
Hadoop

Hive
 is a data warehouse software project built on top of
Apache Hadoop for providing data query and
analysis.
 Hive gives an SQL-like interface to query data
stored in various databases and file systems that
integrate with Hadoop

Apache Pig
 is a high-level platform for creating programs that
run on Apache Hadoop.
 It is a tool/platform which is used to analyze larger
sets of data representing them as data flows.
 perform all the data manipulation operations in
Hadoop using Apache Pig.

Introduction to Hadoop
 Hadoop is an open source framework that allows to
store (HDFS) and process (MapReduce) large data sets
in distributed and parallel manner.

Traditional DB vs. Hadoop
Traditional Database System Hadoop
Data is stored in a central location and
sent to the processor at runtime.
In Hadoop, the program goes to the data. It
initially distributes the data to multiple systems
and later runs the computation wherever the
data is located. (distributed computation)
Traditional Database Systems cannot be
used to process and store a significant
amount of data(big data).
Hadoop works better when the data size is big.
It can process and store a large amount of
data efficiently and effectively.
Traditional RDBMS is used to manage
only structured and semi-structured
data. It cannot be used to control
unstructured data.
Hadoop can process and store a variety of
data, whether it is structured or unstructured.

History of Hadoop
 1997: Doug Cutting, developed Lucene; Open source search and indexing Java
based indexing and open source search software.
 2001: Mike Cafarella, focused on indexing entire web
 Problems:
 Schema less (no tables and columns)
 Durable (once written should never lost)
 Capability of handling component failure ( CPU, Memory, N/w)
 Automatically re-balanced (disk space consumption)
 2003: Google published GFS Paper; developed Nutch DFS.
 Problem of durability and fault tolerance was still not solved.

History of Hadoop…
 Sol: Distributed processing; Divided file system into 64mb chunks, storing each
element on 3 different nodes(replication factor).
 2004: Google published a paper: Map Reduce – Simple Data Processing on Large
Clusters.
 Solved problem of Parallelization, Distribution, Fault Tolerance.
 2005: Cutting integrated Map Reduce into Nutch.
 2006: Cutting named it Hadoop, included HDFS, MR.
 2008: Cutting licensed it under Apache Software Foundations.
 Certain problems/enhancements in Hadoop, created sub projects like Hive, PIG,
HBase, Zookeeper.

Hadoop
 Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs −
 Data is divided into directories and files (uniform sized blocks 128MB)
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checks for successful execution of the code
 Performs sort phase that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.

 Hadoop as a good
choice for:
 Indexing log files
 Sorting vast amounts
of data
 Image analysis
 Search engine
optimization
 Analytics
 Hadoop as a poor
choice for:
 Calculating value of pi to
1,000,000 digits
 Calculating Fibonacci
sequences
 Small structured data

Components of Hadoop
 HDFS
 YARN
 MapReduce

HDFS
 HDFS is a distributed file system that is fault tolerant,
scalable and extremely easy to expand.
 HDFS is the primary distributed storage for Hadoop
applications.
 HDFS provides interfaces for applications to move
themselves closer to data.
 HDFS is designed to process large data sets with
write-once-read-many semantics

HDFS
Namenode
Datanode1 Datanode2 ... Datanode2
Secondary
Namenode

Namenode
 It is master daemon that maintains and manages the DataNodes (slave nodes)
 It records metadata of all blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to file system metadata
 If a file is deleted in HDFS, NameNode will immediately record this in EditLog
 It regularly receives a Heartbeat and a block report from all the DataNodes
in the cluster to ensure that the DataNodes are alive
 It keeps record of all blocks in HDFS and DataNode in which they are stored

DataNode
 It is the slave daemon which runs on each slave machine
 The actual data is stored on DataNodes
 It is responsible for serving read and write requests from the clients
 It is also responsible for creating blocks, deleting blocks and
replicating the same based on the decisions taken by the
NameNode
 It sends heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds

Secondary Namenode
 Works as a helper node to primary NameNode
 Downloads the Fsimage file and edit logs file from
NameNode
 Reads from RAM of NameNode and stores it to hard
disks periodically.
 If NameNode fails, last save Fsimage on secondary
NameNode can be used to recover file system
metadata

HDFS Commands
 Listing files
 Read / Write files
 Upload / Download files
 File Management
 Permissions
 File System
 Administration

List: Files
hdfs dfs -ls / List all the files/directories for the given
hdfs destination path.
hdfs dfs -ls -d /inputnew2 Directories are listed as plain files. In this
case, this command will list the details of
inputnew2 folder.
hdfs dfs -ls -R /user Recursively list all files in hadoop directory
and all subdirectories in user directory.
hdfs dfs -ls /inputnew*
List all the files matching the pattern. In
this case, it will list all the files inside
present directory which starts with
'inputnew'.

Read / Write Files
hdfs dfs -text /inputnew/inputFile.txt
HDFS Command that takes a source file
and outputs the file in text format on the
terminal.
hdfs dfs -cat /inputnew/inputFile.txt This command will display the content of
the HDFS file test on your stdout .
hdfs dfs -appendToFile
/home/ubuntu/test1
/hadoop/text2
Appends the content of a local file test1
to a hdfs file test2.

Upload / Download Files
hdfs dfs -put /home/ubuntu/sample
/hadoop
Copies the file from local file system to HDFS.
hdfs dfs -put -f
/home/ubuntu/sample /hadoop
Copies the file from local file system to HDFS,
and in case the local already exits in the
given destination path, using -f option with
put command will overwrite it.
hdfs dfs -get /newfile /home/ubuntu/ Copies the file from HDFS to local file system.
hdfs dfs -copyFromLocal
/home/ubuntu/sample /hadoop
Works similarly to the put command, except
that the source is restricted to a local file
reference.
hdfs dfs -copyToLocal /newfile
/home/ubuntu/
Works similarly to the put command, except
that the destination is restricted to a local file
reference.

File management
hdfs dfs -cp /hadoop/file1 /hadoop1
Copies file from source to destination on
HDFS. In this case, copying file1 from hadoop
directory to hadoop1 directory.
hdfs dfs -cp -f /hadoop/file1 /hadoop1
Copies file from source to destination on
HDFS. Passing -f overwrites the destination if
it already exists.
hdfs dfs -mv /hadoop/file1 /hadoop1
Move files that match the specified file pattern
<src> to a destination <dst>. When moving
multiple files, destination must be a directory.
hdfs dfs -rm /hadoop/file1 Deletes the file (sends it to the trash).
hdfs dfs -rmdir /hadoop1 Delete a directory.
hdfs dfs -mkdir /hadoop2 Create a directory in specified HDFS location.

Permissions
hdfs dfs -touchz /hadoop3
Creates a file of zero length at <path>
with current time as the timestamp of
that <path>.
hdfs dfs -chmod 755 /hadoop/file1 Changes permissions of file.
hdfs dfs -chown ubuntu:ubuntu
/hadoop
Changes owner of the file. 1st ubuntu
in the command is owner and
2nd one is group.

File System
hdfs dfs -df /hadoop Shows the capacity, free and used space of
the filesystem.
hdfs dfs -df -h /hadoop
Shows the capacity, free and used space of
the filesystem. -h parameter Formats the
sizes of files in a human-readable fashion.
hdfs dfs -du /hadoop/file
Show the amount of space, in bytes, used by
the files that match the specified file pattern.
hdfs dfs -du -s /hadoop/file
Rather than showing the size of each
individual file that matches the pattern,
shows the total (summary) size.

Administration
hadoop version To check the vesrion of Hadoop.
hdfs fsck / It checks the health of the Hadoop file system.
hdfs dfsadmin -safemode leave The command to turn off the safemode of
NameNode.
hdfs dfsadmin –refreshNodes
Re-read the hosts and exclude files to update the
set of Datanodes that are allowed to connect to the
Namenode and those that should be
decommissioned or recommissioned.
hdfs namenode -format Formats the NameNode.

 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Data processing using programming
 Spark: In-memory Data Processing
 PIG, HIVE: Data Processing Services using Query (SQL-like)
 Hbase: NoSQL Database
 Mahout, Spark Mllib: Machine Learning
 Apache Drill: SQL on Hadoop
 Zookeeper: Managing Cluster
 Oozie: Job Scheduling
 Flume, Sqoop: Data Ingesting Services
 Solr& Lucene: Searching & Indexing
 Ambari: Provision, Monitor and Maintain cluster

MapReduce Framework
 Need for parallel distribution of tasks
 Automatic expansion and contraction of processes
 Enables continuation of processes w/o being affected by
network failures or system failures
 MapReduce: Map and reduce
 Do not modify the original data instead, create new data
structures to display their o/p

Features of MapReduce
 Scheduling
 Synchronization
 Data locality
 Handling of errors/faults
 Scale out architecture

MapReduce Framework
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism without mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and partition.
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Special distributed file system. Example: Hadoop Distributed File

Working of MapReduce
 MapReduce programming model works on an algorithm to execute the map and
reduce operations. Algorithm steps as follows:
 Take a large dataset or set of records
 Perform iteration over the data
 Extract interesting patterns to prepare an o/p list using map function
 Arrange o/p list to enable optimization for further processing
 Compute a set of results by using reduce function
 Provide the final output
 MR mode executes given task by dividing into two functions: map and reduce.
Map function is executed first in parallel on different machines. Reduce function
takes output of map function to present final output in an aggregate form.

MapReduce framework
 JobTracker receives the jobs from client applications to process
large information.
 These jobs are assigned in the forms of individual tasks (after a
job is divided into smaller parts) to various TaskTrackers.
 The task distribution is transmitted to the reduce function so that
the final, integrated output, which is an aggregated of the data
processed by the map funciton, can be provided.
 A cluster uses commodity servers to store nodes. The data
processing job is accomplished through MapReduce and HDFS

 Input is provided from large data files in the form of key-value pair,
which is the standard input format in a Hadoop MapReduce
programming model.
 The input data is divided into small pieces, and master and slave
nodes are created. The master node usually executes on the
machine where the data is present, and slaves are made to work
remotely on the data
 The map operation is performed simultaneously on all the data
pieces, which are read by the map function. The map function
extracts the relavent data and generates key-value pair for it.

Client: This initializes the job
JobTracker: is the master daemon for both Job resource management and
scheduling/monitoring of jobs
TaskTracker: is a slave node daemon in the cluster that accepts tasks (Map, Reduce and
Shuffle operations) from a JobTracker
Map task deal with splitting and mapping of data while Reduce task shuffle and
reduce the data
MapReduce Framework

Hadoop Modes
Local Pseudo
distributed
Fully
distributed
•No daemons
•executes all parts of
Hadoop MapReduce
within a single Java
process and uses the local
filesystem as the storage
•No DFS
•useful for testing/
debugging the
MapReduce applications
locally
we can run Hadoop on a
single machine emulating a
distributed cluster
runs different services of
Hadoop as different Java
processes, but within a single
machine.
Each hadoop daemon run as
separate java process.
•supports clusters
that span from a
few nodes to
thousands of nodes
•Full production run

Hadoop MapReduce: A Closer Look
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes

Resource Manager (RM)
 Responsibilities: Resource management and
assignment of all the apps
 Master daemon of YARN
 Requests received by the RM are forwarded to the
corresponding node manager.
 RM does the allocation of available resources.
 RM is highest authority for the allocation of resources.

Node Manager (NM)
Generic, flexible and efficient than TaskTracker.
Dynamically created resource containers.
Container refers to a collection of resources such as
memory, CPU, disk and network IO.
NM is the slave daemon of Yarn.
NM is the per-machine/per-node framework agent,
responsible for containers, monitoring their resource usage
and reporting the same to the ResourceManager.

Container
 Container in YARN is where a unit of work happens in the
form of task.
 A job/application is split in tasks and each task gets
executed in one container having a specific amount of
allocated resources.
 A container can be understood as logical reservation of
resources (memory and vCores) that will be utilized by task
running in that container

Application Master
 Responsible for managing a set of submitted tasks or applications.
 First verifies and validates submitted application’s specifications and
rejects the applications, if there are not enough resources available.
 It also ensures no other application exists with the same ID which is
already submitted.
 Finally, it also observes the states of applications and manages
finished applications to save some Resource Manager’s memory.

YARN: Yet Another Resource Negotiator
 YARN: Resource management + Job scheduling
Criteria MapReduce YARN
Type of processing
batch processing with a single
engine
Real-time, batch, and
interactive processing with
multiple engines
Cluster resource
optimization
Average due to fixed Map and
Reduce slots
Excellent due to central
resource management
Suitable for Only MapReduce applications
MapReduce and non-
MapReduce applications
Managing cluster
resource
Done by JobTracker Done by YARN
Namespace
Supports only one namespace,
i.e., HDFS
Hadoop supports multiple
namespaces

Working of YARN
1.A client program submits the application
2.ResourceManager allocates a specified
container to start the ApplicationMaster (AM)
3. ApplicationMaster, on boot-up, registers with RM
4. AM negotiates with RM for appropriate resource
containers
5. On successful container allocations, AM contacts
NM to launch the container
6. Application code is executed within the
container, and then AM is responded with the
execution status
7. During execution, client communicates directly
with AM or RM to get status, progress updates etc.
8. Once the application is complete AM unregisters
with RM and shuts down, allowing its own container
process

Introduction: Apache Pig
 Pig is a high level scripting language for operating on large datasets inside
Hadoop.
 Compiles scripting language into MapReduce operations.
 provides a simple language called Pig Latin, for queries and data
manipulation
 Pig is multi-query approach reduces the number of times data is scanned.
 Pig was developed for ad-hoc way of creating and executing MapReduce
jobs on very large data sets.
 Pig provides data operations like filters, joins, ordering, etc. and nested data
types like tuples and maps, that are missing from MapReduce.

When to use and Not to use Pig
 When data loads are
time sensitive.
 When processing
various data sources.
 When analytical
insights are required
through sampling.
In places where the data is
completely unstructured, like
video, audio and readable text.
In places where time constraints
exist, as Pig is slower than
MapReduce jobs.
In places where more power is
required to optimize the codes.

Applications of Pig
 Processing of web logs.
 Data processing for search platforms.
 Provides the supports across large data-sets for Ad-hoc queries.
 For exploring large datasets Pig Scripting is used.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 Collecting large amounts of datasets in form of search logs, web
crawls.
 Used where the analytical insights are needed using the sampling.

Apache Pig MapReduce
It is a scripting language.
It is a compiled programming
language.
Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as
compared to MapReduce.
Lines of code is more.
Less effort is needed for Apache
Pig.
More development efforts are
required for MapReduce.
Code efficiency is less as
compared to MapReduce.
As compared to Pig efficiency of
code is higher.

Features of Pig
Rich set of operators
It provides many operators to perform operations like join, sort, filer, etc.
Ease of programming:
Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization opportunities:
The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus
only on semantics of the language.
Extensibility:
Using existing operators, users can develop their own functions to read, process, write data.
UDF’s:
Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data:
Apache Pig analyzes all kinds of data, both structured, unstructured.
It stores the results in HDFS.

Apache Pig: Introduction
 The language used to analyze data in Hadoop using Pig is known
as Pig Latin.
 To perform a particular task Programmers using Pig, programmers
need to write a Pig script using the Pig Latin language, and execute
them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded).
 Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.

Parameter MapReduce Pig
Paradigm Data processing paradigm Procedural dataflow language
Type of language Low level and rigid A high-level language
Join operation
It is difficult to perform join
operations between datasets
Performing a join operation is
simple
Skills needed
The developer needs to have a robust
knowledge of Java
A good knowledge of SQL is
needed
Code length
MapReduce requires 20 times more
code length to accomplish the same
task
Due to the multi-query
approach, the code length is
greatly reduced
Compilation
MapReduce jobs have a prolonged
compilation process
No need for compilation as
every Pig operator is converted
internally into MapReduce jobs
Nested data types Not present in MapReduce Present in Pig

Running Pig
 Pig Latin statements; Pig commands
 Pig Latin statements and Pig commands can be run using interactive
mode and batch mode.
 Pig Latin commands in three ways—via the Grunt interactive shell,
through a script file, and as embedded queries inside Java
programs.
 Pig has six execution modes
 Local mode, Tez Local mode, Spark local mode, Mapreduce mode, Tez mode,
Spark mode

 Local Mode – This mode can be run on a single machine; all files are installed and
run using local host and file system. uses the -x flag (pig -x local).
 Tez Local Mode – This mode is similar to local mode, except internally Pig will
invoke tez runtime engine. Uses -x flag (pig -x tez_local).
 Spark Local Mode – This mode is similar to local mode, except internally Pig will
invoke spark runtime engine. Uses -x flag (pig -x spark_local).
 Mapreduce Mode - To run Pig in mapreduce mode, access to a Hadoop cluster and
HDFS installation. Mapreduce mode is the default mode; Uses -x flag (pig OR pig -
x mapreduce).
 Tez Mode – This mode, access to a Hadoop cluster and HDFS installation. Specify
Tez mode using the -x flag (-x tez).
 Spark Mode - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos
cluster and HDFS installation. Specify Spark mode using the -x flag (-x spark)

Interactive mode
 Local Mode : $ pig -x local
 Tez Local Mode : $ pig -x tez_local
 Spark Local Mode : $ pig -x spark_local
 Mapreduce Mode : $ pig -x mapreduce
 Tez Mode : $ pig -x tez
 Spark Mode : $ pig -x spark

Batch mode
 Local Mode : $ pig -x local id.pig
 Tez Local Mode : $ pig -x tez_local id.pig
 Spark Local Mode : $ pig -x spark_local id.pig
 Mapreduce Mode : $ pig id.pig
 Tez Mode : $ pig -x tez id.pig
 Spark Mode : $ pig -x spark id.pig

Pig Latin statements
 Pig Latin statements are basic constructs used to process data
using Pig.
 A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
 Pig Latin statements may include expressions and schemas.
 Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; ).
 By default, Pig Latin statements are processed using multi-query
execution.

 Pig Latin statements are generally organized as follows:
 A LOAD statement to read data from the file system.
 A series of "transformation" statements to process the data.
 A DUMP statement to view results or a STORE statement to save the results.

The following example, Pig will validate, but not execute, the LOAD and FOREACH
statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
In the following example, Pig will validate and then execute the LOAD, FOREACH, and
DUMP statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(Ram)
(Sam)
(Ham)
(Pam)

Big Data Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Analytics

Similar to Big Data Analytics (20)

More from Sreedhar Chowdam

More from Sreedhar Chowdam (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics