Hadoop

Distributed Big Data
Processing with
Hadoop Platform
Girish Khanzode

What is Hadoop?
• An Apache top level project, open-source implementation of frameworks
for reliable, scalable, distributed computing and data storage.
• A flexible, highly-available architecture for large scale computation and
data processing on a network of commodity hardware.
• Implementation of Google’s MapReduce, using HDFS

Hadoop Goals
• Facilitate the storage and processing of large and/or rapidly growing data
sets, primarily non-structured in nature
• Simple programming models
• High scalability and availability
• Fault-tolerance
• Move computation rather than data
• Use commodity (cheap!) hardware with little redundancy
• Provide cluster based computing

Map Reduce Patent
Google granted US Patent 7,650,331, January 2010 - System and method for efficient large-scale data processing
________________________________________________________________________________________________
A large-scale data processing system and method includes one or more application-independent map modules configured
to read input data and to apply at least one application-specific map operation to the input data to produce
intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the
parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data
values. One or more application-independent reduce modules are configured to retrieve the intermediate data values
and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.

Platform Assumptions
• Hardware will fail
• Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low
latency
• Applications that run on HDFS have large data sets
• A typical file in HDFS is gigabytes / terabytes in size
• It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster
• It should support tens of millions of files in a single instance
• Applications need a write-once-read-many access model
• Moving Computation is cheaper than moving data
• Portability is important

Components
• Map Reduce Layer
– Job tracker (master) - coordinates the execution of jobs
– Task trackers (slaves)- controls the execution of map and reduce tasks in the
machines that do the processing;
• HDFS Layer
– Stores files
– NameNode (master)- manages the file system, keeps meta-data for all the files
and directories in the tree
– DataNodes (slaves)- work horses of the file system. Store and retrieve blocks
when they are told to ( by clients or name node ) and report back to name node
periodically

Architecture – Multi-Node Cluster

HDFS
• Hadoop Distributed File System
• Designed to run on commodity hardware
• Part of Apache Hadoop Core project http://hadoop.apache.org/core/
• Highly fault-tolerant
• Designed for deployment on low-cost hardware
• Provides high throughput access to application data and is suitable for
applications that have large data sets.
– Write-once-read-many access model
• Relaxes a few POSIX requirements to enable streaming access to file system data

HDFS - Key Points
• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated on multiple DataNode for reliability
• Understands rack locality
• One replica on local node, another replica on a remote rack, Third replica on local rack,
Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
• Client talks to both NameNode and DataNodes
• Data is not sent through the NameNode, clients access data directly from DataNode
• Throughput of file system scales nearly linearly with the number of nodes.

NameNode
• DFS Master
– Manages the file system namespace
– Controls read/write access to files
– Manages block replication
– Checkpoints namespace and journals namespace changes for reliability
• Metadata of Name node in Memory
– The entire metadata is in main memory
– No demand paging of FS metadata
• Types of Metadata:
– List of files, file and chunk namespaces; list of blocks, location of replicas; file attributes etc.

DataNodes
• Serve read/write requests from clients
• Perform replication tasks upon instruction by NameNode
• Stores data in the local file system
• Stores metadata of a block (e.g. CRC)
• Serves data and metadata to Clients
• Periodically sends a report of all existing blocks to the NameNode
• Periodically sends heartbeat to NameNode (detect node failures)
• Facilitates Pipelining of Data (to other specified DataNodes)

HDFS High Availability
• Option of running two redundant NameNodes in the
same cluster
• Active/Passive configuration with a hot standby
• Fast fail-over to a new NameNode if a machine crashes
• Graceful administrator-initiated fail-over for planned
maintenance

NameNode Failure
• Prior to Hadoop 2.0, the NameNode
was a single point of failure (SPOF) in
an HDFS cluster
• Secondary Name Node
– Not a standby for Name Node
– Connects to Name Node every hour
– Performs housekeeping, backup of
Name Node metadata
– Saved metadata can rebuild a failed
Name Node

DataNode Failure
• Each DataNode periodically sends a Heartbeat message to
the NameNode
• If the NameNode does not receive a heartbeat from a
particular DataNode for 10 minutes, then it considers that
data node to be dead/out of service.
• NameNode initiates replication of blocks hosted on that data
node to some other data node

MapReduce Framework
• Underlying system takes care of
– partitioning of the input data
– scheduling the program’s execution across several machines
– handling machine failures
– managing inter-machine communication
• Provides inter-node communication
– Failure recovery, consistency etc
– Load balancing, scalability etc
• Suitable for batch processing applications
– Log processing
– Web index building

What is MapReduce Used For?
• At Google:
– Index building for Google Search
– Article clustering for Google News
– Statistical machine translation
• At Yahoo!:
– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail
• At Facebook:
– Data mining
– Ad optimization
– Spam detection

MapReduce Components
• JobTracker
– Map/Reduce Master
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to TaskTrackers
– Monitors task and TaskTracker statuses, Re-executes tasks upon failure
• TaskTrackers
– Map/Reduce Slaves
– Run Map and Reduce tasks upon instruction from the JobTracker
– Manage storage and transmission of intermediate output

Distributed Execution
User
Program
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assign
map
assign
reduce
read
local
write
Remote read,
sort
Output
File 0
Output
File 1
write
Split 0
Split 1
Split 2
Input Data

Working of MapReduce
• The run time partitions the input and provides it to different Map
instances
• Map (k1, v1) -> (k2, v2)
• The run time collects the (k2, v2) pairs and distributes them to several
Reduce functions so that each Reduce function gets the pairs with the
same k2
• Each Reduce produces a single (or zero) file output
• Map and Reduce are user written functions

Input and Output
• MapReduce operates exclusively on <key, value> pairs
• Job Input: <key, value> pairs
• Job Output: <key, value> pairs
• Key and value can be different types, but must be serializable
by the framework.
<k1, v1><k1, v1> <k2, v2><k2, v2> <k3, v3><k3, v3>
Input Output
map reduce

Example - Counting Words
• Given a large collection of documents, output the
frequency for each unique word
• After putting this data into HDFS, Hadoop
automatically splits into blocks and replicates each
block

Input Reader
• Input reader reads a block and divides into splits
• Each split would be sent to a map function
– a line is an input of a map function
• The key could be some internal number (filename-blockid-lineid), the value is the
content of the textual line.
Apple Orange Mongo
Orange Grapes Plum
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Apple Plum Mongo
Apple Apple Plum
Block 1
Block 2
Apple Orange MongoApple Orange Mongo
Orange Grapes PlumOrange Grapes Plum
Apple Plum MongoApple Plum Mongo
Apple Apple PlumApple Apple Plum
Input reader

Mapper - Map Function
• Mapper takes the output generated by input reader and output a list of intermediate <key, value> pairs.
Mapper
Apple Orange MongoApple Orange Mongo
Orange Grapes PlumOrange Grapes Plum
Apple Plum MongoApple Plum Mongo
Apple Apple PlumApple Apple Plum
Apple, 1
Orange, 1
Mongo, 1
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Plum, 1
m1
m2
m3
m4

Reducer - Reduce Function
• Reducer takes the output generated by the Mapper,
aggregates the value for each key, and outputs the final
result
• There is shuffle/sort before reducing.

Apple, 1
Orange, 1
Mongo, 1
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Orange, 1
Orange, 1
Orange, 1
Orange, 1
Grapes, 1Grapes, 1
Mongo, 1
Mongo, 1
Mongo, 1
Mongo, 1
Plum, 1
Plum, 1
Plum, 1
Plum, 1
Plum, 1
Plum, 1
Apple, 4Apple, 4
Orange, 2Orange, 2
Grapes, 1Grapes, 1
Mongo, 2Mongo, 2
Plum, 3Plum, 3
ReducerShuffle/sort
r1
r2
r3
r4
r5

• The same key MUST go to the same reducer
• Different keys CAN go to the same reducer.
Orange, 1
Orange, 1
Orange, 1
Orange, 1 Orange, 2Orange, 2
r2
Orange, 1
Orange, 1
Orange, 1
Orange, 1
Grapes, 1Grapes, 1
Orange, 2Orange, 2
Grapes, 1Grapes, 1
r2
r2

Combiner
• When the map operation outputs its pairs they are already available in memory
• For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying
a combiner class to perform a reduce-type function
• If a combiner is used then the map key-value pairs are not immediately written to the output
• Instead they will be collected in lists, one list per each key value. (optional)
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 2
Plum, 1
Apple, 2
Plum, 1
combiner

Partitioner - Partition function
• When a mapper emits a key value pair, it has to be sent to one of the
reducers - Which one?
• The mechanism sending specific key-value pairs to specific reducers is
called partitioning (the key-value pairs space is partitioned among the
reducers)
• In Hadoop, the default partitioner is Hash-Partitioner, which hashes a
record’s key to determine which partition (and thus which reducer) the
record belongs in
• The number of partition is then equal to the number of reduce tasks for
the job

Importance of Partition
• It has a direct impact on overall performance of the job
• A poorly designed partitioning function will not evenly
distributes the charge over the reducers, potentially loosing
all the interest of the map/reduce distributed infrastructure
• It maybe sometimes necessary to control the key/value pairs
partitioning over the reducers

Importance of Partition
• If a job’s input is a huge set of tokens and their number of occurrences
and that you want to sort them by number of occurrences
Without using any customized partitioner Using some customized partitioner

Example - Word Count
• map(String key, String value):
// key: document name; value: document contents; map (k1,v1) -> list(k2,v2)
for each word w in value: EmitIntermediate(w, "1");
(If input string is (“abc def ghi abc mno pqr”), Map produces {<“abc”,1”>, <“def”, 1>, <“ghi”, 1>, <“abc”,1>,
<“mno”,1>,<“pqr”,1>}
• reduce(String key, Iterator values):
// key: a word; values: a list of counts; reduce (k2,list(v2)) -> list(v2)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
(Example: reduce(“abc”, <1,1>) -> 2)

Simplified MapReduce
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce

JobTracker Failure
• If the master task dies, a new copy can be started from the
last check-pointed state. However, in most cases, the user
restarts the job
• After restarting JobTracker all the jobs running at the time of
the failure should be resubmitted

Handling TaskTracker Failure
• The JobTracker pings every worker periodically
• If no response is received from a worker in a certain amount of time, the
master marks the worker as failed
• Any map tasks completed by the worker are reset back to their initial idle
state, and therefore become eligible for scheduling on other workers.
• Any map task or reduce task in progress on a failed worker is also reset to
idle and becomes eligible for rescheduling.
• Task tracker will stop sending the heartbeat to the Job Tracker

Handling TaskTracker Failure
• Job Tracker notices this failure
• Hasn’t received a heart beat from 10 mins
• Can be configured via mapred.tasktracker.expiry.interval property
• Job Tracker removes this task from the task pool
• Rerun the Job even if map task has ran completely
• Intermediate output resides in the failed task trackers local file system
which is not accessible by the reduce tasks.

Data flow
• Input, final output are stored on a distributed file system
– Scheduler tries to schedule map tasks “close” to physical storage
location of input data
• Intermediate results are stored on local FS of map and reduce
workers
• Output is often input to another map reduce task

Coordination
• Master data structures
– Task status: (idle, in-progress, completed)
– Idle tasks get scheduled as workers become available
– When a map task completes, it sends the master the location and
sizes of its R intermediate files, one for each reducer
– Master pushes this info to reducers
• Master pings workers periodically to detect failures

Failures
• Map worker failure
– Map tasks completed or in-progress at worker are reset to idle
– Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
– Only in-progress tasks are reset to idle
• Master failure
– MapReduce task is aborted and client is notified

How many Map and Reduce jobs?
• M - map tasks, R - reduce tasks
• Rule of thumb
– Make M and R much larger than the number of nodes in cluster
– One DFS chunk per map is common
– Improves dynamic load balancing and speeds recovery from worker failure
• Usually R is smaller than M because output is spread across R files

Mapping Workers to Processors
• MapReduce master takes the location information of the input files and
schedules a map task on a machine that contains a replica of the
corresponding input data
• If failed, it attempts to schedule a map task near a replica of that task's
input data
• When running large MapReduce operations on a significant fraction of
the workers in a cluster, most input data is read locally and consumes no
network bandwidth

Combiner Function
• User can specify a Combiner function that does partial merging of the intermediate local
disk data before it is sent over the network.
• The Combiner function is executed on each machine that performs a map task
• Typically the same code is used to implement both the combiner and the reduce functions
• Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k
– popular words in Word Count
• Can save network time by pre-aggregating at mapper
– combine(k1, list(v1)) -> v2
– Usually same as reduce function
• Works only if reduce function is commutative and associative

Partition Function
• The users of MapReduce specify the number of reduce tasks/output files that they
desire (R)
• Data gets partitioned across these tasks using a partitioning function on the
intermediate key
• A default partitioning function is provided that uses hashing (hash(key) mod R)
• In some cases, it may be useful to partition data by some other function of the
key. The user of the MapReduce library can provide a special partitioning
function.

Task Granularity
• The map phase has M pieces and the reduce phase has R pieces
• M and R should be much larger than the number of worker machines
• Having each worker perform many different tasks improves dynamic load
balancing and also speeds up recovery when a worker fails
• Larger the M and R, more the decisions the master must make
• R is often constrained by users because the output of each reduce task ends up in
a separate output file
• Typically - at Google, M = 200,000 and R = 5,000, using 2,000 worker machines

Execution Summary
• Distributed Processing
– Partition input key/value pairs into chunks, run map() tasks in parallel
– After all map()s are complete, consolidate all emitted values for each
unique emitted key
– Now partition space of output map keys, and run reduce() in parallel
• If map() or reduce() fails -> re-execute

MapReduce – Data Flow
• Input reader – divides input into appropriate size splits which get assigned to a Map
function.
• Map function – maps file data/split to smaller, intermediate <key, value> pairs.
• Partition function – finds the correct reducer: given the key and number of reducers, returns
the desired reducer node. (optional)
• Compare function – input from the Map intermediate output is sorted according to the
compare function. (optional)
• Reduce function – takes intermediate values and reduces to a smaller solution handed back
to the framework.
• Output writer – writes file output.

Execution Overview
• The MapReduce library in user program splits input files into M pieces of typically 16 MB to 64 MB/piece
• It then starts up many copies of the program on a cluster of machines
• One of the copies of the program is the master
• The rest are workers that are assigned work by the master
• There are M map tasks and R reduce tasks to assign
• The master picks idle workers and assigns each one a map task or a reduce task
• A worker who is assigned a map task reads the contents of the assigned input split
• It parses key/value pairs out of the input data and passes each pair to the user-defined Map function
• The intermediate key/value pairs produced by the Map function are buffered in memory
• The locations of these buffered pairs on the local disk are passed back to the master, who forwards these
locations to the reduce workers

Execution Overview
• When a reduce worker is notified by the master about these locations, it uses RPC remote
procedure calls to read the buffered data from the local disks of the map workers
• When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all
occurrences of the same key are grouped together
• The reduce worker iterates over the sorted intermediate data and for each unique intermediate
key encountered, it passes the key and the corresponding set of intermediate values to the user's
Reduce function
• The output of the Reduce function is appended to a final output file for this reduce partition
• When all map tasks and reduce tasks have been completed, the master wakes up the user
program - the MapReduce call in the user program returns back to the user code
• The output of the mapreduce execution is available in the R output files (one per reduce task)

MapReduce Advantages
• Distribution is completely transparent
– Not a single line of distributed programming (ease, correctness)
• Automatic fault-tolerance
– Determinism enables running failed tasks somewhere else again
– Saved intermediate data enables just re-running failed reducers
• Automatic scaling
– As operations as side-effect free, they can be distributed to any number of machines dynamically
• Automatic load-balancing
– Move tasks and speculatively execute duplicate copies of slow tasks (stragglers)

Need for High-Level Languages
• Hadoop is great for large-data processing
– But writing Java programs for everything is verbose and slow
– Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing languages
– Hive - HQL is like SQL
– Pig - Pig Latin is a bit like Perl
• Hive - data warehousing application in Hadoop
– Query language is HQL, variant of SQL
– Tables stored on HDFS as flat files

Need for High-Level Languages
• Pig - large-scale data processing system
– Scripts are written in Pig Latin, a dataflow language
– Developed by Yahoo, now open source
• Common idea
– Provide higher-level language to facilitate large-data processing
– Higher-level language compiles down to Hadoop jobs

Hive - Background
• Started at Facebook
• Data was collected by nightly cron jobs into Oracle DB
• ETL via hand-coded python
• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that
• A data warehouse system to facilitate easy data summarization, ad-hoc
queries and the analysis of large datasets stored in Hadoop compatible
file systems
• Supports Hive Query Language (HQL) statements similar to SQL
statements
Source: cc-licensed slide by Cloudera

Hive
• HiveQL is a subset of SQL covering the most common statements
• HQL statements are broken down by the Hive service into MapReduce jobs and executed
across a Hadoop cluster
• JDBC/ODBC support
• Follows schema-on-read design – very fast initial load
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• Partitions and Buckets (for performance optimization)

Hive Components
• Shell: allows interactive queries
• Driver: session handles, fetch, execute
• Compiler: parse, plan, optimize
• Execution engine: DAG of stages (MR, HDFS, metadata)
• Metastore: schema, location in HDFS, SerDe

Data Model
• Basic column types (int, float, boolean)
• Complex types: List / Map ( associate array), Struct
CREATE TABLE complex (
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>
);
• Built-in functions – mathematical, statistical, string, date, conditional
functions, aggregate functions and functions for working with XML and
JSON

Data Model
• Tables
– Typed columns (int, float, string, boolean)
– list: map (for JSON-like data)
• Partitions
– For example, range-partition tables by date
• Buckets
– Hash partitions within ranges
– useful for sampling, join optimization

Metastore
• Database: namespace containing a set of tables
• Holds table definitions (column types, physical layout)
• Holds partitioning information
• Can be stored in Derby, MySQL, and many other relational
databases

Physical Layout
• Warehouse directory in HDFS
– /user/hive/warehouse
• Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables
• Actual data stored in flat files
– Control char-delimited text or SequenceFiles
– With custom SerDe, can use arbitrary format

Metadata
• Data organized into tables
• Metadata like table schemas stored in the database metastore
• The metastore is the central repository of Hive metadata
• Metastore runs in the same process as the Hive service
• Loading data into a Hive table results in copying the data file into its
working directory and input data is not processed into rows
• HiveQL queries use metadata for query execution

Tables
• Logically made up of the data being stored and the associated metadata
describing the layout of the data in the table.
• The data can reside in HDFS like system or S3
• Hive stores the metadata in a relational database and not in HDFS
• When a table is created, Hive moves the data into its warehouse directory
• External table – Hive refers data outside the warehouse directory

Partitioning
• Hive organizes tables into partitions by dividing a table into coarse-grained parts based on
the value of a partition column, such as date
• Using partitions makes queries faster on slices of the data
• Log files with each record containing a timestamp - If partitioned by date, records for the
same date would be stored in the same partition
• Queries restricted to a particular date or set of dates are more efficient since only required
files are scanned
• Partitioning on multiple dimensions allowed.
• Defined at table creation time
• Separate subdirectory for each partition

Bucketing
• Partitions further organized in buckets for more efficient queries
• Clustered by clause is used to create buckets using the specified column
• Data within a bucket can be additionally sorted by one or more columns

UDF
• Operates on a single row and produces a single row as its output. Most
functions, such as mathematical functions and string functions, are of this
type
• A UDAF (user-defined aggregate functions ) works on multiple input rows
and creates a single output row. COUNT and MAX
• A UDTF (user-defined table-generating functions ) operates on a single
row and produces multiple rows—a table—as output

INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
X =
page_view user pv_users
Hive QL – Join

page_view
user
pv_users
Map
Shuffle
Sort Reduce
Hive QL – Join in Map Reduce

INSERT INTO TABLE pageid_age_sum
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
pv_users
pageid_age_sum
Hive QL – Group By

pv_users pageid_age_sum
Map Shuffle
Sort
Reduce
Hive QL – Group By in Map Reduce

SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid
page_view
result
Hive QL – Group By with Distinct

page_view
Shuffle
and
Sort
Reduce
Shuffle key is a prefix of the sort key.
Hive QL – Group By with Distinct in Map Reduce

page_view
Shuffle
and
Sort
Reduce
Shuffle randomly
Hive QL - Order By

Hive Benefits
• A easy way to process large scale data
• Support SQL-based queries
• Provide more user defined interfaces to extend
Programmability
• Efficient execution plans for performance
• Interoperability with other database tools

Apache Pig
• Framework to analyze large un-structured and semi-structured data on
top of Hadoop
• Consists of a high-level language for expressing data analysis programs,
coupled with infrastructure
• Compiles down to MapReduce jobs
• Infrastructure layer consists of
– a compiler to create sequences of Map-Reduce programs
– language layer consists of a textual language called Pig Latin

Pig Latin
• A scripting language to explore large datasets
• Easy to achieve parallel execution of simple data analysis tasks
• Complex tasks with multiple interrelated data transformations explicitly
encoded
• Automatic optimization
• Create own functions for special-purpose processing
• A script can map to multiple map-reduce jobs

Benefits
• Faster development
– Fewer lines of code (Writing map reduce is like writing SQL queries)
– Re-use the code (Pig library, Piggy bank)
• Conduct a test: Find the top 5 words with most high frequency
– Pig Latin needed 10 lines of code as against 200 lines in Java
– Pig execution time was 15 minutes as against 4 hours in Java

Language Features
• A Pig Latin program is made up of a series of transformations applied to the input data to produce output
• A declarative, SQL-like language, the high level language interface for Hadoop
• Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop
• Keywords - Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order By
• Aggregations - Count, Avg, Sum, Max, Min
• Schema - Defined at query-time not when files are loaded
• UDFs
• Packages for common input/output formats

Language Features
• Virtually all parts of the processing path are customizable: loading, storing,
filtering, grouping, and joining can all be altered by UDFs
• Writing load and store functions is easy once an InputFormat and OutputFormat
exist
• Multi-query: pig combines certain types of operations together in a single pipeline
to reduce the number of times data is scanned.
• Order by provides total ordering across reducers in a balanced way
• Piggybank is a repository of UDF Pig functions shared by the Pig community

Data Types
• Scalar Types - Int, long, float, double, Boolean, null,
chararray, bytearray
• Complex Types
– Field - a piece of data
– Tuple - an ordered set of fields
– Bag - a collection of tuples
– Relation - a bag

Data Types
• Samples
– Tuple is a row in database - ( 0002576169, Tome, 20, 4.0)
• Bag
– a table or a view in database
– an unordered collection of tuples represented using curly braces
• {(0002576169 , Tome, 20, 4.0),
• (0002576170, Mike, 20, 3.6),
• (0002576171 Lucy, 19, 4.0), …. }

Running a Pig Latin Script
• Local mode
– Local host and local file system is used
– Neither Hadoop nor HDFS is required
– Useful for prototyping and debugging
– Suitable only for small datasets
• MapReduce mode
– Run on a Hadoop cluster and HDFS

Running a Pig Latin Script
• Batch mode - run a script directly
– Pig –x local my_pig_script.pig
– Pig –x mapreduce my_pig_script.pig
• Interactive mode uses the Pig shell Grunt to run script
– Grunt> Lines = LOAD ‘/input/input.txt’ AS (line: char array);
– Grunt> Unique = DISTINCT Lines;
– Grunt> DUMP Unique;

Operations
• Loading data
– LOAD loads input data
– Lines=LOAD ‘input/access.log’ AS (line: chararray);
• Projection
– FOREACH … GENERATE … (similar to SELECT)
– takes a set of expressions and applies them to every record

Operations
• Grouping
– collects together records with the same key
• Dump/Store
– DUMP displays results to screen - The trigger for Pig to start execution
– STORE save results to file system
• Aggregation
– AVG, COUNT, MAX, MIN, SUM

Foreach ... Generate
• Iterates over the members of a bag
• Example
– student_data = FOREACH students GENERATE studentid, name
• The result of statement is another bag
• Elements are named as in the input bag

Positional Reference
• Fields referred by positional notation or by name (alias)
– students = LOAD 'student.txt' USING PigStorage() AS
(name:chararray, age:int, gpa:float);
– DUMP A;
– (John,18,4.0F)
– (Mary,19,3.8F)
– (Bill,20,3.9F)
– studentname = Foreach students Generate $1 as studentname;

Group
• Groups data in one relation
• GROUP and COGROUP operators are identical but COGROUP creates a
nested set of output tuples
• Both operators work with one or more relations
• For readability GROUP is used in statements involving one relation
• COGROUP is used in statements involving two or more relations

Group
grunt> DUMP A;
(John, Pasta)
(Kate, Burger)
(Joe, Orange)
(Eve, Apple)
Let’s group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5,{(John, Pasta),(Eve, Apple)})
(6,{(Kate, Burger),(Joe, Orange)})

Dump & Store
A = LOAD ‘input/pig/multiquery/A’;
B = FILTER A by $1 == “apple”;
C = FILTER A by $1 == “apple”;
SOTRE B INTO “output/b”
STORE C INTO “output/c”
Relations B&C both derived from A
Prior this would create two MapReduce jobs
Pig will now create one MapReduce job with output results

Count
• Computes the number of elements in a bag.
• Requires a preceding GROUP ALL statement for global
counts and GROUP BY statement for group counts.
• X = FOREACH B GENERATE COUNT(A);

Pig Operation - Order
• Sorts a relation based on one or more fields
• In Pig, relations are unordered
• If you order relation A to produce relation X, relations A and X still contain
the same elements
• student = ORDER students BY gpa DESC;

Example 1
raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query);
clean1 = FILTER raw BY id > 20 AND id < 100;
clean2 = FOREACH clean1 GENERATE
user, time,
org.apache.pig.tutorial.sanitze(query) as query;
user_groups = GROUP clean2 BY (user, query);
user_query_counts = FOREACH user_groups
GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);
STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');
Read from
HDFS
Input format is
tab delimited
Run time
schema
Row filtering on
predicates
Group records
Group aggregation
Store output in file
Text comma
delimited

Example 2
A = load '$widerow' using PigStorage('u0001') as (name: chararray, c0: int, c1: int, c2:
int);
B = group A by name parallel 10;
C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2;
D = filter C by c0 > 100 and c1 > 100 and c2 > 100;
store D into '$out';
Script
Argument
Ctrl- A
delimited
Define column
types
Require 10 reducer
jobs

Example 3 – Repartition join
register pigperf.jar;
A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, timestamp, estimated_revenue);
B = foreach A generate user, (double) estimated_revenue;
alpha = load ’users' using PigStorage('u0001') as (name, phone, address, city, state, zip);
beta = foreach alpha generate name, city;
C = join beta by name, B by user parallel 40;
D = group C by $0;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'L3out';
Register UDFs and
custom input formats Ctrl- A
delimited
Join two datasets
using 40 reducers
Load second file
Group after join
Refer columns by position

Example 3 – Replicated Join
register pigperf.jar;
A = load ‘page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, timestamp, estimated_revenue);
Big = foreach A generate user, (double) estimated_revenue;
alpha = load ’users' using PigStorage('u0001') as (name, phone, address, city, state,
zip);
small = foreach alpha generate name, city;
C = join Big by user, small by name using ‘replicated’;
store C into ‘out';
Replicated join.
Small dataset is
second
Optimization in joining a big dataset with a
small one

Example 5: Multiple Outputs
Split records into
sets
Dump Command to
display data
Store multiple
output
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
STORE x INTO 'x_out';
STORE y INTO 'y_out';
STORE z INTO 'z_out';

Parallel Independent Jobs
D1 = load 'data1' …
C1 = join D1 by a, D2 by b
C2 = join D1 by c, D3 by d
C1 and C2 are two independent
jobs that can run in parallel

Logic Plan
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
Load
Load
Filter
Join
Group
For each
Store

Physical Plan
• 1:1 correspondence with the logical plan
• Except for - Join, Distinct, (Co)Group, Order
• Several optimizations are automatic

Pig Handling
• Schema and type checking
• Translating into efficient physical dataflow
– sequence of one or more MapReduce jobs
• Exploiting data reduction opportunities
– early partial aggregation via a combiner
• Executing the system-level dataflow
– running the MapReduce jobs
• Tracking progress, errors etc

Example Problem
• Given user data in one file,
and website data in
another, find the top 5 most
visited pages by users aged
18-25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5

In Pig Latin
• Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;

Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Job 1
Job 2
Job 3
Translation to MapReduce

Apache Flume
• A distributed, reliable and available service for efficiently
collecting, aggregating, and moving large amounts of log data
• One-stop solution for data collection of all formats
• A simple and flexible architecture based on streaming data flows
• A robust and fault tolerant architecture with tuneable reliability
mechanisms and many failover and recovery mechanisms

Apache Flume
• Uses a simple extensible data model that allows for online
analytic application
• Complex flows
– Flume allows a user to build multi-hop flows where events travel
through multiple agents before reaching the final destination
– It also allows fan-in and fan-out flows, contextual routing and backup
routes (fail-over) for failed hops

Parallelism
• When running in MapReduce mode it’s important that the degree of
parallelism matches the size of the dataset
• By default, Pig uses one reducer per 1GB of input, up to a maximum of
999
• User can override these parameters by setting
pig.exec.reducers.bytes.per.reducer (the default is 1000000000 bytes)
and pig.exec.reducers.max (default 999)

Parallelism
• To explicitly set the number of reducers for each job, use a PARALLEL
clause for operators that run in the reduce phase
• These include all the grouping and joining operators GROUP, COGROUP,
JOIN, CROSS as well as DISTINCT and ORDER
• Following line sets the number of reducers to 30 for the GROUP
– grouped_records = GROUP records BY year PARALLEL 30;
• Alternatively, set the default_parallel option for all subsequent jobs
– grunt> set default_parallel 30

High Level Overview
• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP HDFS
Agent

Data Flow Model
• A Flume event is defined as a unit of data flow having a byte payload and an
optional set of string attributes
• A Flume agent is a (JVM) process that hosts the components through which
events flow from an external source to the next destination (hop)
• A Flume source consumes events delivered to it by an external source like a web
server
• The external source sends events to Flume in a format that is recognized by the
target Flume source

Data Flow Model
• For example, an Avro Flume source can be used to receive Avro events
from Avro clients or other Flume agents in the flow that send events from
an Avro sink
• A similar flow can be defined using a Thrift Flume Source to receive
events from a Thrift Sink or a Flume Thrift RPC Client or Thrift clients
written in any language generated from the Flume thrift protocol
• When a Flume source receives an event, it stores it into one or more
channels

Data Flow Model
• The channel is a passive store that keeps the event until it’s consumed by a Flume
sink
• The file channel is one example – it is backed by the local file system
• The sink removes the event from the channel and puts it into an external
repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of
the next Flume agent (next hop) in the flow
• The source and sink within the given agent run asynchronously with the events
staged in the channel

HDFS Sink
• This sink writes events into the Hadoop Distributed File System (HDFS)
• Supports creating text and sequence files along with compression
• The files can be rolled (close current file and create a new one)
periodically based on the elapsed time or size of data or number of events
• Buckets/partitions data by attributes like timestamp or machine where
the event originated

HDFS Sink
• The HDFS directory path may contain formatting escape sequences that
will replaced by the HDFS sink to generate a directory/file name to store
the events
• Hadoop installation is required so that Flume can use the Hadoop jars to
communicate with the HDFS cluster
• A version of Hadoop that supports the sync() call is required.

Reliability & Recoverability
• The events are staged in a channel on each agent
• The events are then delivered to the next agent or terminal repository
(like HDFS) in the flow
• The events are removed from a channel only after they are stored in the
channel of next agent or in the terminal repository
• This is a how the single-hop message delivery semantics in Flume provide
end-to-end reliability of the flow

• Flume uses a transactional approach to guarantee the reliable delivery of the events
• The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the
events placed in or provided by a transaction provided by the channel
• This ensures that the set of events are reliably passed from point to point in the flow
• In the case of a multi-hop flow, the sink from the previous hop and the source from the next
hop both have their transactions running to ensure that the data is safely stored in the
channel of the next hop.

• The events are staged in the channel, which manages recovery from
failure
• Flume supports a durable file channel which is backed by the local file
system
• There’s also a memory channel which simply stores the events in an in-
memory queue, which is faster but any events still left in the memory
channel when an agent process dies can’t be recovered

Multi-Agent flow
• For data to flow across multiple agents or hops, the sink of the previous
agent and source of the current hop need to be Avro type with the sink
pointing to the hostname (or IP address) and port of the source

Consolidation
• A very common scenario in log collection is a large number of log
producing clients sending data to a few consumer agents that are
attached to the storage subsystem
• For example, logs collected from hundreds of web servers sent to a dozen
of agents that write to HDFS cluster
• This can be achieved in Flume by configuring a number of first tier agents
with an Avro sink, all pointing to an Avro source of single agent (can use
the thrift sources / sinks / clients in such a scenario)
• This source on the second tier agent consolidates the received events into
a single channel which is consumed by a sink to its final destination

Multiplexing Flow
• Flume supports multiplexing the event flow to one or more destinations
• This is achieved by defining a flow multiplexer that can replicate or selectively
route an event to one or more channels
• For the multiplexing case, an event is delivered to a subset of available channels
when an event’s attribute matches a preconfigured value
• For example, if an event attribute called “txnType” is set to “customer”, then it
should go to channel1 and channel3, if it’s “vendor” then it should go to channel2,
otherwise channel3
• The mapping can be set in the agent’s configuration file

Apache Sqoop
• An open-source tool to extract data from a relational database into HDFS or
HBase
• Available for MySQL, PostgreSQL, Oracle, SQL Server and DB2
• A single client program that creates one or more MapReduce jobs to perform
their tasks
• By default 4 map tasks are used in parallel
• Sqoop does not have any server processes
• If we assume a table with 1 million records and four mappers, then each will
process 2,50,000 records

Apache Sqoop
• With its knowledge of the primary key column, Sqoop can create four SQL statements to
retrieve the data that each use the desired primary key column range as caveats.
• In the simplest case, this could be as straightforward as adding something like WHERE id
BETWEEN 1 and 250000 to the first statement and using different id ranges for the others.
• In addition to writing the contents of the database table to HDFS, Sqoop also provides a
generated Java source file (widgets.java) written to the current local directory.
• Sqoop uses generated code to handle the de=serialization of table-specific data from the
database source before writing it to HDFS.

Commands
• codegen - Generate code to interact with database records
• Create - hive-table Import a table definition into Hive
• eval - Evaluate a SQL statement and display the results
• export - Export an HDFS directory to a database table
• help - List available commands
• import - Import a table from a database to HDFS
• Import - all-tables Import tables from a database to HDFS
• job - Work with saved jobs
• List - databases List available databases on a server
• List - tables List available tables in a database
• merge - Merge results of incremental imports
• metastore - Run a standalone Sqoop metastore
• version - Display version information

Importing data into Hive using Sqoop
• Sqoop has significant integration with Hive, allowing it to import data
from a relational source into either new or existing Hive tables
$ sqoop import –connect jdbc:mysql://10.0.0.100/hadooptest
--username hadoopuser -P
--table employees --hive-import --hive-table employees

Export
An export uses HDFS as the source of data and a remote database as the destination
% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1
> --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits
> --input-fields-terminated-by '0001'
...
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in 10.8947
seconds (3.7633 bytes/sec)
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.

Apache Zookeeper
• A set of tools to build distributed applications that can safely handle partial failures
• A rich set of building blocks to build a large class of coordination data structures and protocols like
distributed queues, distributed locks, and leader election among a group of peers
• Runs on a collection of machines for high availability
• Avoids single points of failure for reliability
• facilitates loosely coupled interactions so that participants that do not need to know about one
another
• An open source, shared repository of implementations and recipes of common coordination
patterns
• Built-in services like naming, configuration management, locks and synchronization, group
services for high performance co-ordination services for distributed applications

Apache Zookeeper
• Written in Java
• Strongly consistent
• Ensemble of Servers
• In-memory data
• Datasets must fit in memory
• Shared hierarchical namespace
• Access Control list for each node
• Similar to a file system

Zookeeper Service
• All servers store the copy of data in memory
• A leader is elected at start up
• Followers respond to clients
• All updates go through leaders
• Responses are sent when a majority of servers have persisted changes

Znodes
• A unified concept of a node called a znode
• Acts both as a container of data (like a file) and a container of other
znodes (like a directory)
• Form a hierarchical namespace
• Two types - ephemeral or persistent. Set at creation time and not
changed later
• To build a membership list, create a parent znode with the name of the
group and child znodes with the name of the group members (servers)

Znodes
• Referenced by paths, which are represented as slash-delimited Unicode
character strings, like file system paths in Unix
• Paths must be absolute, so they must begin with a slash character
• Paths are canonical hence each path has a single representation

API
• create - Creates a znode
• delete - Deletes a znode (must not have any children)
• exists - Tests if a znode exists and retrieves its metadata
• getACL, setACL - Gets/sets the ACL for a znode
• getChildren - Gets a list of the children of a znode
• getData, setData - Gets/sets data associated with a znode
• sync - Synchronizes a client’s view of a znode with ZooKeeper

Ephemeral Nodes
• Deleted when the creating client’s session ends
• May not have children, not even ephemeral ones.
• Even though tied to a client session, they are visible to all
clients subject to their ACL policy
• Ideal for building applications that need to know when
certain distributed resources are available

Ephemeral Nodes
• Example - a group membership service that allows any
process to discover the members of the group at any
particular time
• A persistent znode is not tied to the client’s session and is
deleted only when explicitly deleted by any client

Sequence Nodes
• A sequential znode has a sequence number
• A znode created with sequential flag set has the value of a monotonically
increasing 10 digit counter, maintained by the parent znode, appended to
its name
• If a client asks to create a sequential znode with the name /a/b -, for
example, then the znode created may have the name /a/b-3.4

Sequence Nodes
• Another new sequential znode with the name /a/b will have a unique
name with a larger value of the counter - for example, /a/b-5
• Sequence numbers can be used to impose a global ordering on events in a
distributed system, and may be used by the client to infer the ordering

Watches
• Allow clients to get notifications when a znode changes (data or children)
• Works like a one shot call-back mechanism when connections or znode
state changes
• A watch set on an exists operation will be triggered when the znode being
watched is created, deleted, or has its data updated
• A watch set on a getData operation will be triggered when the znode
being watched is deleted or has its data updated.

Watches
• A watch set on a getChildren operation will be triggered when a child of
the znode being watched is created or deleted, or when the znode itself is
deleted
• Triggered only once
• To receive multiple notifications, a client needs to reregister the watch
• If a client wishes to receive further notifications for the znode’s existence
(to be notified when it is deleted, for example), it needs to call the exists
operation again to set a new watch

High Availability Mechanism
• For resilience, the Zookeeper runs in replicated mode on a
cluster of machines called an ensemble
• Achieves high-availability through replication, and can
provide a service as long as a majority of the machines in the
ensemble are up
• ZooKeeper uses a protocol called Zab that runs in two phases
and is repeated indefinitely

• Phase 1: Leader election
– The machines in an ensemble go through a process of electing a
distinguished member, called the leader
– The other machines are termed followers.
– This phase is finished once a majority (or quorum) of followers have
synchronized their state with the leader

• Phase 2: Atomic broadcast
– All write requests are forwarded to the leader, which broadcasts the update
to the followers
– When a majority have persisted the change, the leader commits the update,
and the client gets a response saying the update succeeded
– The protocol for achieving consensus is designed to be atomic, so a change
either succeeds or fails. It resembles a two-phase commit

References
1. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google)
2. D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!)
3. R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128,
4. A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009
5. http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
6. http://flume.apache.org
7. http://incubator.apache.org/sqoop/
8. Roman, Javi. "The Hadoop Ecosystem Table". github.com
9. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data John Wiley & Sons
10. "Refactor the scheduler out of the JobTracker". Hadoop Common. Apache Software Foundation
11. Jones, M. Tim (6 December 2011). "Scheduling in Hadoop". ibm.com. IBM
12. "Hadoop and Distributed Computing at Yahoo!". Yahoo!
13. "HDFS: Facebook has the world's largest Hadoop cluster!” Hadoopblog.blogspot.com
14. "Under the Hood: Hadoop Distributed File system reliability with Namenode and Avatarnode". Facebook

Thank You
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Hadoop

Similar to Hadoop (20)

More from Girish Khanzode

More from Girish Khanzode (9)

Recently uploaded

Recently uploaded (20)

Hadoop

Editor's Notes