SlideShare a Scribd company logo
1 of 173
Distributed Big Data
Processing with
Hadoop Platform
Girish Khanzode
Contents
What is Hadoop?
• An Apache top level project, open-source implementation of frameworks
for reliable, scalable, distributed computing and data storage.
• A flexible, highly-available architecture for large scale computation and
data processing on a network of commodity hardware.
• Implementation of Google’s MapReduce, using HDFS
Hadoop Goals
• Facilitate the storage and processing of large and/or rapidly growing data
sets, primarily non-structured in nature
• Simple programming models
• High scalability and availability
• Fault-tolerance
• Move computation rather than data
• Use commodity (cheap!) hardware with little redundancy
• Provide cluster based computing
Map Reduce Patent
Google granted US Patent 7,650,331, January 2010 - System and method for efficient large-scale data processing
________________________________________________________________________________________________
A large-scale data processing system and method includes one or more application-independent map modules configured
to read input data and to apply at least one application-specific map operation to the input data to produce
intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the
parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data
values. One or more application-independent reduce modules are configured to retrieve the intermediate data values
and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.
Platform Assumptions
• Hardware will fail
• Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low
latency
• Applications that run on HDFS have large data sets
• A typical file in HDFS is gigabytes / terabytes in size
• It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster
• It should support tens of millions of files in a single instance
• Applications need a write-once-read-many access model
• Moving Computation is cheaper than moving data
• Portability is important
Components
Components
• Map Reduce Layer
– Job tracker (master) - coordinates the execution of jobs
– Task trackers (slaves)- controls the execution of map and reduce tasks in the
machines that do the processing;
• HDFS Layer
– Stores files
– NameNode (master)- manages the file system, keeps meta-data for all the files
and directories in the tree
– DataNodes (slaves)- work horses of the file system. Store and retrieve blocks
when they are told to ( by clients or name node ) and report back to name node
periodically
Architecture – Multi-Node Cluster
HDFS
• Hadoop Distributed File System
• Designed to run on commodity hardware
• Part of Apache Hadoop Core project http://hadoop.apache.org/core/
• Highly fault-tolerant
• Designed for deployment on low-cost hardware
• Provides high throughput access to application data and is suitable for
applications that have large data sets.
– Write-once-read-many access model
• Relaxes a few POSIX requirements to enable streaming access to file system data
HDFS Architecture
HDFS - Key Points
• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated on multiple DataNode for reliability
• Understands rack locality
• One replica on local node, another replica on a remote rack, Third replica on local rack,
Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
• Client talks to both NameNode and DataNodes
• Data is not sent through the NameNode, clients access data directly from DataNode
• Throughput of file system scales nearly linearly with the number of nodes.
NameNode
• DFS Master
– Manages the file system namespace
– Controls read/write access to files
– Manages block replication
– Checkpoints namespace and journals namespace changes for reliability
• Metadata of Name node in Memory
– The entire metadata is in main memory
– No demand paging of FS metadata
• Types of Metadata:
– List of files, file and chunk namespaces; list of blocks, location of replicas; file attributes etc.
DataNodes
• Serve read/write requests from clients
• Perform replication tasks upon instruction by NameNode
• Stores data in the local file system
• Stores metadata of a block (e.g. CRC)
• Serves data and metadata to Clients
• Periodically sends a report of all existing blocks to the NameNode
• Periodically sends heartbeat to NameNode (detect node failures)
• Facilitates Pipelining of Data (to other specified DataNodes)
DataNodes
HDFS High Availability
• Option of running two redundant NameNodes in the
same cluster
• Active/Passive configuration with a hot standby
• Fast fail-over to a new NameNode if a machine crashes
• Graceful administrator-initiated fail-over for planned
maintenance
HDFS High Availability
NameNode Failure
• Prior to Hadoop 2.0, the NameNode
was a single point of failure (SPOF) in
an HDFS cluster
• Secondary Name Node
– Not a standby for Name Node
– Connects to Name Node every hour
– Performs housekeeping, backup of
Name Node metadata
– Saved metadata can rebuild a failed
Name Node
DataNode Failure
• Each DataNode periodically sends a Heartbeat message to
the NameNode
• If the NameNode does not receive a heartbeat from a
particular DataNode for 10 minutes, then it considers that
data node to be dead/out of service.
• NameNode initiates replication of blocks hosted on that data
node to some other data node
DataNode Failure
MapReduce Framework
• Programming model developed at Google
• Sort/merge based distributed computing
• Automatic parallel execution & distribution
• Fault tolerant
• Functional style programming for parallelism across a large cluster of
nodes
• Works like a parallel Unix pipeline:
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
MapReduce Framework
• Underlying system takes care of
– partitioning of the input data
– scheduling the program’s execution across several machines
– handling machine failures
– managing inter-machine communication
• Provides inter-node communication
– Failure recovery, consistency etc
– Load balancing, scalability etc
• Suitable for batch processing applications
– Log processing
– Web index building
MapReduce Framework
What is MapReduce Used For?
• At Google:
– Index building for Google Search
– Article clustering for Google News
– Statistical machine translation
• At Yahoo!:
– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail
• At Facebook:
– Data mining
– Ad optimization
– Spam detection
MapReduce Components
• JobTracker
– Map/Reduce Master
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to TaskTrackers
– Monitors task and TaskTracker statuses, Re-executes tasks upon failure
• TaskTrackers
– Map/Reduce Slaves
– Run Map and Reduce tasks upon instruction from the JobTracker
– Manage storage and transmission of intermediate output
MapReduce Components
Distributed Execution
User
Program
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assign
map
assign
reduce
read
local
write
Remote read,
sort
Output
File 0
Output
File 1
write
Split 0
Split 1
Split 2
Input Data
Working of MapReduce
• The run time partitions the input and provides it to different Map
instances
• Map (k1, v1) -> (k2, v2)
• The run time collects the (k2, v2) pairs and distributes them to several
Reduce functions so that each Reduce function gets the pairs with the
same k2
• Each Reduce produces a single (or zero) file output
• Map and Reduce are user written functions
Input and Output
• MapReduce operates exclusively on <key, value> pairs
• Job Input: <key, value> pairs
• Job Output: <key, value> pairs
• Key and value can be different types, but must be serializable
by the framework.
<k1, v1><k1, v1> <k2, v2><k2, v2> <k3, v3><k3, v3>
Input Output
map reduce
Example - Counting Words
• Given a large collection of documents, output the
frequency for each unique word
• After putting this data into HDFS, Hadoop
automatically splits into blocks and replicates each
block
Input Reader
• Input reader reads a block and divides into splits
• Each split would be sent to a map function
– a line is an input of a map function
• The key could be some internal number (filename-blockid-lineid), the value is the
content of the textual line.
Apple Orange Mongo
Orange Grapes Plum
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Apple Plum Mongo
Apple Apple Plum
Block 1
Block 2
Apple Orange MongoApple Orange Mongo
Orange Grapes PlumOrange Grapes Plum
Apple Plum MongoApple Plum Mongo
Apple Apple PlumApple Apple Plum
Input reader
Mapper - Map Function
• Mapper takes the output generated by input reader and output a list of intermediate <key, value> pairs.
Mapper
Apple Orange MongoApple Orange Mongo
Orange Grapes PlumOrange Grapes Plum
Apple Plum MongoApple Plum Mongo
Apple Apple PlumApple Apple Plum
Apple, 1
Orange, 1
Mongo, 1
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Plum, 1
m1
m2
m3
m4
Reducer - Reduce Function
• Reducer takes the output generated by the Mapper,
aggregates the value for each key, and outputs the final
result
• There is shuffle/sort before reducing.
Reducer - Reduce Function
Apple, 1
Orange, 1
Mongo, 1
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Orange, 1
Orange, 1
Orange, 1
Orange, 1
Grapes, 1Grapes, 1
Mongo, 1
Mongo, 1
Mongo, 1
Mongo, 1
Plum, 1
Plum, 1
Plum, 1
Plum, 1
Plum, 1
Plum, 1
Apple, 4Apple, 4
Orange, 2Orange, 2
Grapes, 1Grapes, 1
Mongo, 2Mongo, 2
Plum, 3Plum, 3
ReducerShuffle/sort
r1
r2
r3
r4
r5
Reducer - Reduce Function
• The same key MUST go to the same reducer
• Different keys CAN go to the same reducer.
Orange, 1
Orange, 1
Orange, 1
Orange, 1 Orange, 2Orange, 2
r2
Orange, 1
Orange, 1
Orange, 1
Orange, 1
Grapes, 1Grapes, 1
Orange, 2Orange, 2
Grapes, 1Grapes, 1
r2
r2
Combiner
• When the map operation outputs its pairs they are already available in memory
• For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying
a combiner class to perform a reduce-type function
• If a combiner is used then the map key-value pairs are not immediately written to the output
• Instead they will be collected in lists, one list per each key value. (optional)
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 2
Plum, 1
Apple, 2
Plum, 1
combiner
Partitioner - Partition function
• When a mapper emits a key value pair, it has to be sent to one of the
reducers - Which one?
• The mechanism sending specific key-value pairs to specific reducers is
called partitioning (the key-value pairs space is partitioned among the
reducers)
• In Hadoop, the default partitioner is Hash-Partitioner, which hashes a
record’s key to determine which partition (and thus which reducer) the
record belongs in
• The number of partition is then equal to the number of reduce tasks for
the job
Importance of Partition
• It has a direct impact on overall performance of the job
• A poorly designed partitioning function will not evenly
distributes the charge over the reducers, potentially loosing
all the interest of the map/reduce distributed infrastructure
• It maybe sometimes necessary to control the key/value pairs
partitioning over the reducers
Importance of Partition
• If a job’s input is a huge set of tokens and their number of occurrences
and that you want to sort them by number of occurrences
Without using any customized partitioner Using some customized partitioner
Example - Word Count
• map(String key, String value):
// key: document name; value: document contents; map (k1,v1) -> list(k2,v2)
for each word w in value: EmitIntermediate(w, "1");
(If input string is (“abc def ghi abc mno pqr”), Map produces {<“abc”,1”>, <“def”, 1>, <“ghi”, 1>, <“abc”,1>,
<“mno”,1>,<“pqr”,1>}
• reduce(String key, Iterator values):
// key: a word; values: a list of counts; reduce (k2,list(v2)) -> list(v2)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
(Example: reduce(“abc”, <1,1>) -> 2)
Physical Flow
Simplified MapReduce
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce
JobTracker Failure
• If the master task dies, a new copy can be started from the
last check-pointed state. However, in most cases, the user
restarts the job
• After restarting JobTracker all the jobs running at the time of
the failure should be resubmitted
Handling TaskTracker Failure
• The JobTracker pings every worker periodically
• If no response is received from a worker in a certain amount of time, the
master marks the worker as failed
• Any map tasks completed by the worker are reset back to their initial idle
state, and therefore become eligible for scheduling on other workers.
• Any map task or reduce task in progress on a failed worker is also reset to
idle and becomes eligible for rescheduling.
• Task tracker will stop sending the heartbeat to the Job Tracker
Handling TaskTracker Failure
• Job Tracker notices this failure
• Hasn’t received a heart beat from 10 mins
• Can be configured via mapred.tasktracker.expiry.interval property
• Job Tracker removes this task from the task pool
• Rerun the Job even if map task has ran completely
• Intermediate output resides in the failed task trackers local file system
which is not accessible by the reduce tasks.
Handling TaskTracker Failure
Data flow
• Input, final output are stored on a distributed file system
– Scheduler tries to schedule map tasks “close” to physical storage
location of input data
• Intermediate results are stored on local FS of map and reduce
workers
• Output is often input to another map reduce task
Coordination
• Master data structures
– Task status: (idle, in-progress, completed)
– Idle tasks get scheduled as workers become available
– When a map task completes, it sends the master the location and
sizes of its R intermediate files, one for each reducer
– Master pushes this info to reducers
• Master pings workers periodically to detect failures
Failures
• Map worker failure
– Map tasks completed or in-progress at worker are reset to idle
– Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
– Only in-progress tasks are reset to idle
• Master failure
– MapReduce task is aborted and client is notified
How many Map and Reduce jobs?
• M - map tasks, R - reduce tasks
• Rule of thumb
– Make M and R much larger than the number of nodes in cluster
– One DFS chunk per map is common
– Improves dynamic load balancing and speeds recovery from worker failure
• Usually R is smaller than M because output is spread across R files
Mapping Workers to Processors
• MapReduce master takes the location information of the input files and
schedules a map task on a machine that contains a replica of the
corresponding input data
• If failed, it attempts to schedule a map task near a replica of that task's
input data
• When running large MapReduce operations on a significant fraction of
the workers in a cluster, most input data is read locally and consumes no
network bandwidth
Combiner Function
• User can specify a Combiner function that does partial merging of the intermediate local
disk data before it is sent over the network.
• The Combiner function is executed on each machine that performs a map task
• Typically the same code is used to implement both the combiner and the reduce functions
• Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k
– popular words in Word Count
• Can save network time by pre-aggregating at mapper
– combine(k1, list(v1)) -> v2
– Usually same as reduce function
• Works only if reduce function is commutative and associative
Partition Function
• The users of MapReduce specify the number of reduce tasks/output files that they
desire (R)
• Data gets partitioned across these tasks using a partitioning function on the
intermediate key
• A default partitioning function is provided that uses hashing (hash(key) mod R)
• In some cases, it may be useful to partition data by some other function of the
key. The user of the MapReduce library can provide a special partitioning
function.
Task Granularity
• The map phase has M pieces and the reduce phase has R pieces
• M and R should be much larger than the number of worker machines
• Having each worker perform many different tasks improves dynamic load
balancing and also speeds up recovery when a worker fails
• Larger the M and R, more the decisions the master must make
• R is often constrained by users because the output of each reduce task ends up in
a separate output file
• Typically - at Google, M = 200,000 and R = 5,000, using 2,000 worker machines
Job Execution View
Data Flow
Execution Summary
• Distributed Processing
– Partition input key/value pairs into chunks, run map() tasks in parallel
– After all map()s are complete, consolidate all emitted values for each
unique emitted key
– Now partition space of output map keys, and run reduce() in parallel
• If map() or reduce() fails -> re-execute
MapReduce – Data Flow
• Input reader – divides input into appropriate size splits which get assigned to a Map
function.
• Map function – maps file data/split to smaller, intermediate <key, value> pairs.
• Partition function – finds the correct reducer: given the key and number of reducers, returns
the desired reducer node. (optional)
• Compare function – input from the Map intermediate output is sorted according to the
compare function. (optional)
• Reduce function – takes intermediate values and reduces to a smaller solution handed back
to the framework.
• Output writer – writes file output.
Execution Overview
• The MapReduce library in user program splits input files into M pieces of typically 16 MB to 64 MB/piece
• It then starts up many copies of the program on a cluster of machines
• One of the copies of the program is the master
• The rest are workers that are assigned work by the master
• There are M map tasks and R reduce tasks to assign
• The master picks idle workers and assigns each one a map task or a reduce task
• A worker who is assigned a map task reads the contents of the assigned input split
• It parses key/value pairs out of the input data and passes each pair to the user-defined Map function
• The intermediate key/value pairs produced by the Map function are buffered in memory
• The locations of these buffered pairs on the local disk are passed back to the master, who forwards these
locations to the reduce workers
Execution Overview
• When a reduce worker is notified by the master about these locations, it uses RPC remote
procedure calls to read the buffered data from the local disks of the map workers
• When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all
occurrences of the same key are grouped together
• The reduce worker iterates over the sorted intermediate data and for each unique intermediate
key encountered, it passes the key and the corresponding set of intermediate values to the user's
Reduce function
• The output of the Reduce function is appended to a final output file for this reduce partition
• When all map tasks and reduce tasks have been completed, the master wakes up the user
program - the MapReduce call in the user program returns back to the user code
• The output of the mapreduce execution is available in the R output files (one per reduce task)
MapReduce Advantages
• Distribution is completely transparent
– Not a single line of distributed programming (ease, correctness)
• Automatic fault-tolerance
– Determinism enables running failed tasks somewhere else again
– Saved intermediate data enables just re-running failed reducers
• Automatic scaling
– As operations as side-effect free, they can be distributed to any number of machines dynamically
• Automatic load-balancing
– Move tasks and speculatively execute duplicate copies of slow tasks (stragglers)
APACHE HIVE
Need for High-Level Languages
• Hadoop is great for large-data processing
– But writing Java programs for everything is verbose and slow
– Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing languages
– Hive - HQL is like SQL
– Pig - Pig Latin is a bit like Perl
• Hive - data warehousing application in Hadoop
– Query language is HQL, variant of SQL
– Tables stored on HDFS as flat files
Need for High-Level Languages
• Pig - large-scale data processing system
– Scripts are written in Pig Latin, a dataflow language
– Developed by Yahoo, now open source
• Common idea
– Provide higher-level language to facilitate large-data processing
– Higher-level language compiles down to Hadoop jobs
Hive - Background
• Started at Facebook
• Data was collected by nightly cron jobs into Oracle DB
• ETL via hand-coded python
• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that
• A data warehouse system to facilitate easy data summarization, ad-hoc
queries and the analysis of large datasets stored in Hadoop compatible
file systems
• Supports Hive Query Language (HQL) statements similar to SQL
statements
Source: cc-licensed slide by Cloudera
Hive
• HiveQL is a subset of SQL covering the most common statements
• HQL statements are broken down by the Hive service into MapReduce jobs and executed
across a Hadoop cluster
• JDBC/ODBC support
• Follows schema-on-read design – very fast initial load
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• Partitions and Buckets (for performance optimization)
Hive Components
• Shell: allows interactive queries
• Driver: session handles, fetch, execute
• Compiler: parse, plan, optimize
• Execution engine: DAG of stages (MR, HDFS, metadata)
• Metastore: schema, location in HDFS, SerDe
Source: cc-licensed slide by Cloudera
Hive Components
Architecture
Data Model
• Basic column types (int, float, boolean)
• Complex types: List / Map ( associate array), Struct
CREATE TABLE complex (
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>
);
• Built-in functions – mathematical, statistical, string, date, conditional
functions, aggregate functions and functions for working with XML and
JSON
Data Model
• Tables
– Typed columns (int, float, string, boolean)
– list: map (for JSON-like data)
• Partitions
– For example, range-partition tables by date
• Buckets
– Hash partitions within ranges
– useful for sampling, join optimization
Source: cc-licensed slide by Cloudera
Metastore
• Database: namespace containing a set of tables
• Holds table definitions (column types, physical layout)
• Holds partitioning information
• Can be stored in Derby, MySQL, and many other relational
databases
Source: cc-licensed slide by Cloudera
Physical Layout
• Warehouse directory in HDFS
– /user/hive/warehouse
• Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables
• Actual data stored in flat files
– Control char-delimited text or SequenceFiles
– With custom SerDe, can use arbitrary format
Source: cc-licensed slide by Cloudera
Metadata
• Data organized into tables
• Metadata like table schemas stored in the database metastore
• The metastore is the central repository of Hive metadata
• Metastore runs in the same process as the Hive service
• Loading data into a Hive table results in copying the data file into its
working directory and input data is not processed into rows
• HiveQL queries use metadata for query execution
Tables
• Logically made up of the data being stored and the associated metadata
describing the layout of the data in the table.
• The data can reside in HDFS like system or S3
• Hive stores the metadata in a relational database and not in HDFS
• When a table is created, Hive moves the data into its warehouse directory
• External table – Hive refers data outside the warehouse directory
Partitioning
• Hive organizes tables into partitions by dividing a table into coarse-grained parts based on
the value of a partition column, such as date
• Using partitions makes queries faster on slices of the data
• Log files with each record containing a timestamp - If partitioned by date, records for the
same date would be stored in the same partition
• Queries restricted to a particular date or set of dates are more efficient since only required
files are scanned
• Partitioning on multiple dimensions allowed.
• Defined at table creation time
• Separate subdirectory for each partition
Bucketing
• Partitions further organized in buckets for more efficient queries
• Clustered by clause is used to create buckets using the specified column
• Data within a bucket can be additionally sorted by one or more columns
UDF
• Operates on a single row and produces a single row as its output. Most
functions, such as mathematical functions and string functions, are of this
type
• A UDAF (user-defined aggregate functions ) works on multiple input rows
and creates a single output row. COUNT and MAX
• A UDTF (user-defined table-generating functions ) operates on a single
row and produces multiple rows—a table—as output
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
X =
page_view user pv_users
Hive QL – Join
page_view
user
pv_users
Map
Shuffle
Sort Reduce
Hive QL – Join in Map Reduce
INSERT INTO TABLE pageid_age_sum
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
pv_users
pageid_age_sum
Hive QL – Group By
pv_users pageid_age_sum
Map Shuffle
Sort
Reduce
Hive QL – Group By in Map Reduce
SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid
page_view
result
Hive QL – Group By with Distinct
page_view
Shuffle
and
Sort
Reduce
Shuffle key is a prefix of the sort key.
Hive QL – Group By with Distinct in Map Reduce
page_view
Shuffle
and
Sort
Reduce
Shuffle randomly
Hive QL - Order By
Hive Benefits
• A easy way to process large scale data
• Support SQL-based queries
• Provide more user defined interfaces to extend
Programmability
• Efficient execution plans for performance
• Interoperability with other database tools
APACHE PIG
Apache Pig
• Framework to analyze large un-structured and semi-structured data on
top of Hadoop
• Consists of a high-level language for expressing data analysis programs,
coupled with infrastructure
• Compiles down to MapReduce jobs
• Infrastructure layer consists of
– a compiler to create sequences of Map-Reduce programs
– language layer consists of a textual language called Pig Latin
Apache Pig
Pig Latin
• A scripting language to explore large datasets
• Easy to achieve parallel execution of simple data analysis tasks
• Complex tasks with multiple interrelated data transformations explicitly
encoded
• Automatic optimization
• Create own functions for special-purpose processing
• A script can map to multiple map-reduce jobs
Benefits
Benefits
• Faster development
– Fewer lines of code (Writing map reduce is like writing SQL queries)
– Re-use the code (Pig library, Piggy bank)
• Conduct a test: Find the top 5 words with most high frequency
– Pig Latin needed 10 lines of code as against 200 lines in Java
– Pig execution time was 15 minutes as against 4 hours in Java
Language Features
Language Features
• A Pig Latin program is made up of a series of transformations applied to the input data to produce output
• A declarative, SQL-like language, the high level language interface for Hadoop
• Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop
• Keywords - Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order By
• Aggregations - Count, Avg, Sum, Max, Min
• Schema - Defined at query-time not when files are loaded
• UDFs
• Packages for common input/output formats
Language Features
• Virtually all parts of the processing path are customizable: loading, storing,
filtering, grouping, and joining can all be altered by UDFs
• Writing load and store functions is easy once an InputFormat and OutputFormat
exist
• Multi-query: pig combines certain types of operations together in a single pipeline
to reduce the number of times data is scanned.
• Order by provides total ordering across reducers in a balanced way
• Piggybank is a repository of UDF Pig functions shared by the Pig community
Data Types
• Scalar Types - Int, long, float, double, Boolean, null,
chararray, bytearray
• Complex Types
– Field - a piece of data
– Tuple - an ordered set of fields
– Bag - a collection of tuples
– Relation - a bag
Data Types
• Samples
– Tuple is a row in database - ( 0002576169, Tome, 20, 4.0)
• Bag
– a table or a view in database
– an unordered collection of tuples represented using curly braces
• {(0002576169 , Tome, 20, 4.0),
• (0002576170, Mike, 20, 3.6),
• (0002576171 Lucy, 19, 4.0), …. }
Running a Pig Latin Script
• Local mode
– Local host and local file system is used
– Neither Hadoop nor HDFS is required
– Useful for prototyping and debugging
– Suitable only for small datasets
• MapReduce mode
– Run on a Hadoop cluster and HDFS
Running a Pig Latin Script
• Batch mode - run a script directly
– Pig –x local my_pig_script.pig
– Pig –x mapreduce my_pig_script.pig
• Interactive mode uses the Pig shell Grunt to run script
– Grunt> Lines = LOAD ‘/input/input.txt’ AS (line: char array);
– Grunt> Unique = DISTINCT Lines;
– Grunt> DUMP Unique;
Running a Pig Latin Script
Operations
• Loading data
– LOAD loads input data
– Lines=LOAD ‘input/access.log’ AS (line: chararray);
• Projection
– FOREACH … GENERATE … (similar to SELECT)
– takes a set of expressions and applies them to every record
Operations
• Grouping
– collects together records with the same key
• Dump/Store
– DUMP displays results to screen - The trigger for Pig to start execution
– STORE save results to file system
• Aggregation
– AVG, COUNT, MAX, MIN, SUM
Foreach ... Generate
• Iterates over the members of a bag
• Example
– student_data = FOREACH students GENERATE studentid, name
• The result of statement is another bag
• Elements are named as in the input bag
Positional Reference
• Fields referred by positional notation or by name (alias)
– students = LOAD 'student.txt' USING PigStorage() AS
(name:chararray, age:int, gpa:float);
– DUMP A;
– (John,18,4.0F)
– (Mary,19,3.8F)
– (Bill,20,3.9F)
– studentname = Foreach students Generate $1 as studentname;
Positional Reference
Group
• Groups data in one relation
• GROUP and COGROUP operators are identical but COGROUP creates a
nested set of output tuples
• Both operators work with one or more relations
• For readability GROUP is used in statements involving one relation
• COGROUP is used in statements involving two or more relations
Group
grunt> DUMP A;
(John, Pasta)
(Kate, Burger)
(Joe, Orange)
(Eve, Apple)
Let’s group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5,{(John, Pasta),(Eve, Apple)})
(6,{(Kate, Burger),(Joe, Orange)})
Dump & Store
A = LOAD ‘input/pig/multiquery/A’;
B = FILTER A by $1 == “apple”;
C = FILTER A by $1 == “apple”;
SOTRE B INTO “output/b”
STORE C INTO “output/c”
Relations B&C both derived from A
Prior this would create two MapReduce jobs
Pig will now create one MapReduce job with output results
Count
• Computes the number of elements in a bag.
• Requires a preceding GROUP ALL statement for global
counts and GROUP BY statement for group counts.
• X = FOREACH B GENERATE COUNT(A);
Pig Operation - Order
• Sorts a relation based on one or more fields
• In Pig, relations are unordered
• If you order relation A to produce relation X, relations A and X still contain
the same elements
• student = ORDER students BY gpa DESC;
Example 1
raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query);
clean1 = FILTER raw BY id > 20 AND id < 100;
clean2 = FOREACH clean1 GENERATE
user, time,
org.apache.pig.tutorial.sanitze(query) as query;
user_groups = GROUP clean2 BY (user, query);
user_query_counts = FOREACH user_groups
GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);
STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');
Read from
HDFS
Input format is
tab delimited
Run time
schema
Row filtering on
predicates
Group records
Group aggregation
Store output in file
Text comma
delimited
Example 2
A = load '$widerow' using PigStorage('u0001') as (name: chararray, c0: int, c1: int, c2:
int);
B = group A by name parallel 10;
C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2;
D = filter C by c0 > 100 and c1 > 100 and c2 > 100;
store D into '$out';
Script
Argument
Ctrl- A
delimited
Define column
types
Require 10 reducer
jobs
Example 3 – Repartition join
register pigperf.jar;
A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, timestamp, estimated_revenue);
B = foreach A generate user, (double) estimated_revenue;
alpha = load ’users' using PigStorage('u0001') as (name, phone, address, city, state, zip);
beta = foreach alpha generate name, city;
C = join beta by name, B by user parallel 40;
D = group C by $0;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'L3out';
Register UDFs and
custom input formats Ctrl- A
delimited
Join two datasets
using 40 reducers
Load second file
Group after join
Refer columns by position
Example 3 – Replicated Join
register pigperf.jar;
A = load ‘page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, timestamp, estimated_revenue);
Big = foreach A generate user, (double) estimated_revenue;
alpha = load ’users' using PigStorage('u0001') as (name, phone, address, city, state,
zip);
small = foreach alpha generate name, city;
C = join Big by user, small by name using ‘replicated’;
store C into ‘out';
Replicated join.
Small dataset is
second
Optimization in joining a big dataset with a
small one
Example 5: Multiple Outputs
Split records into
sets
Dump Command to
display data
Store multiple
output
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
STORE x INTO 'x_out';
STORE y INTO 'y_out';
STORE z INTO 'z_out';
Parallel Independent Jobs
D1 = load 'data1' …
D2 = load 'data2' …
D3 = load 'data3' …
C1 = join D1 by a, D2 by b
C2 = join D1 by c, D3 by d
C1 and C2 are two independent
jobs that can run in parallel
Pig Compilation
Logic Plan
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
Load
Load
Filter
Join
Group
For each
Store
Physical Plan
• 1:1 correspondence with the logical plan
• Except for - Join, Distinct, (Co)Group, Order
• Several optimizations are automatic
Pig Handling
• Schema and type checking
• Translating into efficient physical dataflow
– sequence of one or more MapReduce jobs
• Exploiting data reduction opportunities
– early partial aggregation via a combiner
• Executing the system-level dataflow
– running the MapReduce jobs
• Tracking progress, errors etc
Example Problem
• Given user data in one file,
and website data in
another, find the top 5 most
visited pages by users aged
18-25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
In Pig Latin
• Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Job 1
Job 2
Job 3
Translation to MapReduce
APACHE FLUME
Apache Flume
• A distributed, reliable and available service for efficiently
collecting, aggregating, and moving large amounts of log data
• One-stop solution for data collection of all formats
• A simple and flexible architecture based on streaming data flows
• A robust and fault tolerant architecture with tuneable reliability
mechanisms and many failover and recovery mechanisms
Apache Flume
• Uses a simple extensible data model that allows for online
analytic application
• Complex flows
– Flume allows a user to build multi-hop flows where events travel
through multiple agents before reaching the final destination
– It also allows fan-in and fan-out flows, contextual routing and backup
routes (fail-over) for failed hops
Apache Flume
Parallelism
• When running in MapReduce mode it’s important that the degree of
parallelism matches the size of the dataset
• By default, Pig uses one reducer per 1GB of input, up to a maximum of
999
• User can override these parameters by setting
pig.exec.reducers.bytes.per.reducer (the default is 1000000000 bytes)
and pig.exec.reducers.max (default 999)
Parallelism
• To explicitly set the number of reducers for each job, use a PARALLEL
clause for operators that run in the reduce phase
• These include all the grouping and joining operators GROUP, COGROUP,
JOIN, CROSS as well as DISTINCT and ORDER
• Following line sets the number of reducers to 30 for the GROUP
– grouped_records = GROUP records BY year PARALLEL 30;
• Alternatively, set the default_parallel option for all subsequent jobs
– grunt> set default_parallel 30
High Level Overview
• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP HDFS
Agent
Data Flow Model
• A Flume event is defined as a unit of data flow having a byte payload and an
optional set of string attributes
• A Flume agent is a (JVM) process that hosts the components through which
events flow from an external source to the next destination (hop)
• A Flume source consumes events delivered to it by an external source like a web
server
• The external source sends events to Flume in a format that is recognized by the
target Flume source
Data Flow Model
• For example, an Avro Flume source can be used to receive Avro events
from Avro clients or other Flume agents in the flow that send events from
an Avro sink
• A similar flow can be defined using a Thrift Flume Source to receive
events from a Thrift Sink or a Flume Thrift RPC Client or Thrift clients
written in any language generated from the Flume thrift protocol
• When a Flume source receives an event, it stores it into one or more
channels
Data Flow Model
• The channel is a passive store that keeps the event until it’s consumed by a Flume
sink
• The file channel is one example – it is backed by the local file system
• The sink removes the event from the channel and puts it into an external
repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of
the next Flume agent (next hop) in the flow
• The source and sink within the given agent run asynchronously with the events
staged in the channel
HDFS Sink
• This sink writes events into the Hadoop Distributed File System (HDFS)
• Supports creating text and sequence files along with compression
• The files can be rolled (close current file and create a new one)
periodically based on the elapsed time or size of data or number of events
• Buckets/partitions data by attributes like timestamp or machine where
the event originated
HDFS Sink
• The HDFS directory path may contain formatting escape sequences that
will replaced by the HDFS sink to generate a directory/file name to store
the events
• Hadoop installation is required so that Flume can use the Hadoop jars to
communicate with the HDFS cluster
• A version of Hadoop that supports the sync() call is required.
Reliability & Recoverability
• The events are staged in a channel on each agent
• The events are then delivered to the next agent or terminal repository
(like HDFS) in the flow
• The events are removed from a channel only after they are stored in the
channel of next agent or in the terminal repository
• This is a how the single-hop message delivery semantics in Flume provide
end-to-end reliability of the flow
Reliability & Recoverability
• Flume uses a transactional approach to guarantee the reliable delivery of the events
• The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the
events placed in or provided by a transaction provided by the channel
• This ensures that the set of events are reliably passed from point to point in the flow
• In the case of a multi-hop flow, the sink from the previous hop and the source from the next
hop both have their transactions running to ensure that the data is safely stored in the
channel of the next hop.
Reliability & Recoverability
• The events are staged in the channel, which manages recovery from
failure
• Flume supports a durable file channel which is backed by the local file
system
• There’s also a memory channel which simply stores the events in an in-
memory queue, which is faster but any events still left in the memory
channel when an agent process dies can’t be recovered
Multi-Agent flow
• For data to flow across multiple agents or hops, the sink of the previous
agent and source of the current hop need to be Avro type with the sink
pointing to the hostname (or IP address) and port of the source
Consolidation
• A very common scenario in log collection is a large number of log
producing clients sending data to a few consumer agents that are
attached to the storage subsystem
• For example, logs collected from hundreds of web servers sent to a dozen
of agents that write to HDFS cluster
• This can be achieved in Flume by configuring a number of first tier agents
with an Avro sink, all pointing to an Avro source of single agent (can use
the thrift sources / sinks / clients in such a scenario)
• This source on the second tier agent consolidates the received events into
a single channel which is consumed by a sink to its final destination
Consolidation
Multiplexing Flow
• Flume supports multiplexing the event flow to one or more destinations
• This is achieved by defining a flow multiplexer that can replicate or selectively
route an event to one or more channels
• For the multiplexing case, an event is delivered to a subset of available channels
when an event’s attribute matches a preconfigured value
• For example, if an event attribute called “txnType” is set to “customer”, then it
should go to channel1 and channel3, if it’s “vendor” then it should go to channel2,
otherwise channel3
• The mapping can be set in the agent’s configuration file
Multiplexing Flow
APACHE SQOOP
Apache Sqoop
• An open-source tool to extract data from a relational database into HDFS or
HBase
• Available for MySQL, PostgreSQL, Oracle, SQL Server and DB2
• A single client program that creates one or more MapReduce jobs to perform
their tasks
• By default 4 map tasks are used in parallel
• Sqoop does not have any server processes
• If we assume a table with 1 million records and four mappers, then each will
process 2,50,000 records
Apache Sqoop
• With its knowledge of the primary key column, Sqoop can create four SQL statements to
retrieve the data that each use the desired primary key column range as caveats.
• In the simplest case, this could be as straightforward as adding something like WHERE id
BETWEEN 1 and 250000 to the first statement and using different id ranges for the others.
• In addition to writing the contents of the database table to HDFS, Sqoop also provides a
generated Java source file (widgets.java) written to the current local directory.
• Sqoop uses generated code to handle the de=serialization of table-specific data from the
database source before writing it to HDFS.
Apache Sqoop
Architecture
Commands
• codegen - Generate code to interact with database records
• Create - hive-table Import a table definition into Hive
• eval - Evaluate a SQL statement and display the results
• export - Export an HDFS directory to a database table
• help - List available commands
• import - Import a table from a database to HDFS
• Import - all-tables Import tables from a database to HDFS
• job - Work with saved jobs
• List - databases List available databases on a server
• List - tables List available tables in a database
• merge - Merge results of incremental imports
• metastore - Run a standalone Sqoop metastore
• version - Display version information
Importing data into Hive using Sqoop
• Sqoop has significant integration with Hive, allowing it to import data
from a relational source into either new or existing Hive tables
$ sqoop import –connect jdbc:mysql://10.0.0.100/hadooptest
--username hadoopuser -P
--table employees --hive-import --hive-table employees
Export
An export uses HDFS as the source of data and a remote database as the destination
% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 
> --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits 
> --input-fields-terminated-by '0001'
...
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in 10.8947
seconds (3.7633 bytes/sec)
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.
Export
APACHE ZOOKEEPER
Apache Zookeeper
• A set of tools to build distributed applications that can safely handle partial failures
• A rich set of building blocks to build a large class of coordination data structures and protocols like
distributed queues, distributed locks, and leader election among a group of peers
• Runs on a collection of machines for high availability
• Avoids single points of failure for reliability
• facilitates loosely coupled interactions so that participants that do not need to know about one
another
• An open source, shared repository of implementations and recipes of common coordination
patterns
• Built-in services like naming, configuration management, locks and synchronization, group
services for high performance co-ordination services for distributed applications
Apache Zookeeper
• Written in Java
• Strongly consistent
• Ensemble of Servers
• In-memory data
• Datasets must fit in memory
• Shared hierarchical namespace
• Access Control list for each node
• Similar to a file system
Apache Zookeeper
Zookeeper Service
• All servers store the copy of data in memory
• A leader is elected at start up
• Followers respond to clients
• All updates go through leaders
• Responses are sent when a majority of servers have persisted changes
Zookeeper Service
High Availability
Znodes
• A unified concept of a node called a znode
• Acts both as a container of data (like a file) and a container of other
znodes (like a directory)
• Form a hierarchical namespace
• Two types - ephemeral or persistent. Set at creation time and not
changed later
• To build a membership list, create a parent znode with the name of the
group and child znodes with the name of the group members (servers)
Znodes
• Referenced by paths, which are represented as slash-delimited Unicode
character strings, like file system paths in Unix
• Paths must be absolute, so they must begin with a slash character
• Paths are canonical hence each path has a single representation
API
• create - Creates a znode
• delete - Deletes a znode (must not have any children)
• exists - Tests if a znode exists and retrieves its metadata
• getACL, setACL - Gets/sets the ACL for a znode
• getChildren - Gets a list of the children of a znode
• getData, setData - Gets/sets data associated with a znode
• sync - Synchronizes a client’s view of a znode with ZooKeeper
Ephemeral Nodes
• Deleted when the creating client’s session ends
• May not have children, not even ephemeral ones.
• Even though tied to a client session, they are visible to all
clients subject to their ACL policy
• Ideal for building applications that need to know when
certain distributed resources are available
Ephemeral Nodes
• Example - a group membership service that allows any
process to discover the members of the group at any
particular time
• A persistent znode is not tied to the client’s session and is
deleted only when explicitly deleted by any client
Sequence Nodes
• A sequential znode has a sequence number
• A znode created with sequential flag set has the value of a monotonically
increasing 10 digit counter, maintained by the parent znode, appended to
its name
• If a client asks to create a sequential znode with the name /a/b -, for
example, then the znode created may have the name /a/b-3.4
Sequence Nodes
• Another new sequential znode with the name /a/b will have a unique
name with a larger value of the counter - for example, /a/b-5
• Sequence numbers can be used to impose a global ordering on events in a
distributed system, and may be used by the client to infer the ordering
Watches
• Allow clients to get notifications when a znode changes (data or children)
• Works like a one shot call-back mechanism when connections or znode
state changes
• A watch set on an exists operation will be triggered when the znode being
watched is created, deleted, or has its data updated
• A watch set on a getData operation will be triggered when the znode
being watched is deleted or has its data updated.
Watches
• A watch set on a getChildren operation will be triggered when a child of
the znode being watched is created or deleted, or when the znode itself is
deleted
• Triggered only once
• To receive multiple notifications, a client needs to reregister the watch
• If a client wishes to receive further notifications for the znode’s existence
(to be notified when it is deleted, for example), it needs to call the exists
operation again to set a new watch
High Availability Mechanism
• For resilience, the Zookeeper runs in replicated mode on a
cluster of machines called an ensemble
• Achieves high-availability through replication, and can
provide a service as long as a majority of the machines in the
ensemble are up
• ZooKeeper uses a protocol called Zab that runs in two phases
and is repeated indefinitely
High Availability Mechanism
• Phase 1: Leader election
– The machines in an ensemble go through a process of electing a
distinguished member, called the leader
– The other machines are termed followers.
– This phase is finished once a majority (or quorum) of followers have
synchronized their state with the leader
High Availability Mechanism
• Phase 2: Atomic broadcast
– All write requests are forwarded to the leader, which broadcasts the update
to the followers
– When a majority have persisted the change, the leader commits the update,
and the client gets a response saying the update succeeded
– The protocol for achieving consensus is designed to be atomic, so a change
either succeeds or fails. It resembles a two-phase commit
References
1. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google)
2. D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!)
3. R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128,
4. A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009
5. http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
6. http://flume.apache.org
7. http://incubator.apache.org/sqoop/
8. Roman, Javi. "The Hadoop Ecosystem Table". github.com
9. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data John Wiley & Sons
10. "Refactor the scheduler out of the JobTracker". Hadoop Common. Apache Software Foundation
11. Jones, M. Tim (6 December 2011). "Scheduling in Hadoop". ibm.com. IBM
12. "Hadoop and Distributed Computing at Yahoo!". Yahoo!
13. "HDFS: Facebook has the world's largest Hadoop cluster!” Hadoopblog.blogspot.com
14. "Under the Hood: Hadoop Distributed File system reliability with Namenode and Avatarnode". Facebook
Thank You
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

More Related Content

What's hot

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DBJonathan Lau
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at ViadeoCepoi Eugen
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 

What's hot (20)

Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Spark core
Spark coreSpark core
Spark core
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DB
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Viewers also liked

Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsZitao Liu
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data WarehousingAlexey Grigorev
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expediahuguk
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 

Viewers also liked (7)

Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similar to Hadoop

HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 

Similar to Hadoop (20)

HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Lecture17.ppt
Lecture17.pptLecture17.ppt
Lecture17.ppt
 
Lecture17 (1).ppt
Lecture17 (1).pptLecture17 (1).ppt
Lecture17 (1).ppt
 
Lecture17.ppt
Lecture17.pptLecture17.ppt
Lecture17.ppt
 
big data ppt.ppt
big data ppt.pptbig data ppt.ppt
big data ppt.ppt
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 

More from Girish Khanzode (9)

Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
IR
IRIR
IR
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
NLP
NLPNLP
NLP
 
NLTK
NLTKNLTK
NLTK
 
NoSql
NoSqlNoSql
NoSql
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Language R
Language RLanguage R
Language R
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Hadoop

  • 1. Distributed Big Data Processing with Hadoop Platform Girish Khanzode
  • 3. What is Hadoop? • An Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. • A flexible, highly-available architecture for large scale computation and data processing on a network of commodity hardware. • Implementation of Google’s MapReduce, using HDFS
  • 4. Hadoop Goals • Facilitate the storage and processing of large and/or rapidly growing data sets, primarily non-structured in nature • Simple programming models • High scalability and availability • Fault-tolerance • Move computation rather than data • Use commodity (cheap!) hardware with little redundancy • Provide cluster based computing
  • 5. Map Reduce Patent Google granted US Patent 7,650,331, January 2010 - System and method for efficient large-scale data processing ________________________________________________________________________________________________ A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.
  • 6. Platform Assumptions • Hardware will fail • Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency • Applications that run on HDFS have large data sets • A typical file in HDFS is gigabytes / terabytes in size • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster • It should support tens of millions of files in a single instance • Applications need a write-once-read-many access model • Moving Computation is cheaper than moving data • Portability is important
  • 8. Components • Map Reduce Layer – Job tracker (master) - coordinates the execution of jobs – Task trackers (slaves)- controls the execution of map and reduce tasks in the machines that do the processing; • HDFS Layer – Stores files – NameNode (master)- manages the file system, keeps meta-data for all the files and directories in the tree – DataNodes (slaves)- work horses of the file system. Store and retrieve blocks when they are told to ( by clients or name node ) and report back to name node periodically
  • 10. HDFS • Hadoop Distributed File System • Designed to run on commodity hardware • Part of Apache Hadoop Core project http://hadoop.apache.org/core/ • Highly fault-tolerant • Designed for deployment on low-cost hardware • Provides high throughput access to application data and is suitable for applications that have large data sets. – Write-once-read-many access model • Relaxes a few POSIX requirements to enable streaming access to file system data
  • 12. HDFS - Key Points • Files are broken in to large blocks. – Typically 128 MB block size – Blocks are replicated on multiple DataNode for reliability • Understands rack locality • One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed • Data placement exposed so that computation can be migrated to data • Client talks to both NameNode and DataNodes • Data is not sent through the NameNode, clients access data directly from DataNode • Throughput of file system scales nearly linearly with the number of nodes.
  • 13. NameNode • DFS Master – Manages the file system namespace – Controls read/write access to files – Manages block replication – Checkpoints namespace and journals namespace changes for reliability • Metadata of Name node in Memory – The entire metadata is in main memory – No demand paging of FS metadata • Types of Metadata: – List of files, file and chunk namespaces; list of blocks, location of replicas; file attributes etc.
  • 14. DataNodes • Serve read/write requests from clients • Perform replication tasks upon instruction by NameNode • Stores data in the local file system • Stores metadata of a block (e.g. CRC) • Serves data and metadata to Clients • Periodically sends a report of all existing blocks to the NameNode • Periodically sends heartbeat to NameNode (detect node failures) • Facilitates Pipelining of Data (to other specified DataNodes)
  • 16. HDFS High Availability • Option of running two redundant NameNodes in the same cluster • Active/Passive configuration with a hot standby • Fast fail-over to a new NameNode if a machine crashes • Graceful administrator-initiated fail-over for planned maintenance
  • 18. NameNode Failure • Prior to Hadoop 2.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster • Secondary Name Node – Not a standby for Name Node – Connects to Name Node every hour – Performs housekeeping, backup of Name Node metadata – Saved metadata can rebuild a failed Name Node
  • 19. DataNode Failure • Each DataNode periodically sends a Heartbeat message to the NameNode • If the NameNode does not receive a heartbeat from a particular DataNode for 10 minutes, then it considers that data node to be dead/out of service. • NameNode initiates replication of blocks hosted on that data node to some other data node
  • 21. MapReduce Framework • Programming model developed at Google • Sort/merge based distributed computing • Automatic parallel execution & distribution • Fault tolerant • Functional style programming for parallelism across a large cluster of nodes • Works like a parallel Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output
  • 22. MapReduce Framework • Underlying system takes care of – partitioning of the input data – scheduling the program’s execution across several machines – handling machine failures – managing inter-machine communication • Provides inter-node communication – Failure recovery, consistency etc – Load balancing, scalability etc • Suitable for batch processing applications – Log processing – Web index building
  • 24. What is MapReduce Used For? • At Google: – Index building for Google Search – Article clustering for Google News – Statistical machine translation • At Yahoo!: – Index building for Yahoo! Search – Spam detection for Yahoo! Mail • At Facebook: – Data mining – Ad optimization – Spam detection
  • 25. MapReduce Components • JobTracker – Map/Reduce Master – Accepts MR jobs submitted by users – Assigns Map and Reduce tasks to TaskTrackers – Monitors task and TaskTracker statuses, Re-executes tasks upon failure • TaskTrackers – Map/Reduce Slaves – Run Map and Reduce tasks upon instruction from the JobTracker – Manage storage and transmission of intermediate output
  • 27. Distributed Execution User Program Worker Worker Master Worker Worker Worker fork fork fork assign map assign reduce read local write Remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data
  • 28. Working of MapReduce • The run time partitions the input and provides it to different Map instances • Map (k1, v1) -> (k2, v2) • The run time collects the (k2, v2) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same k2 • Each Reduce produces a single (or zero) file output • Map and Reduce are user written functions
  • 29. Input and Output • MapReduce operates exclusively on <key, value> pairs • Job Input: <key, value> pairs • Job Output: <key, value> pairs • Key and value can be different types, but must be serializable by the framework. <k1, v1><k1, v1> <k2, v2><k2, v2> <k3, v3><k3, v3> Input Output map reduce
  • 30. Example - Counting Words • Given a large collection of documents, output the frequency for each unique word • After putting this data into HDFS, Hadoop automatically splits into blocks and replicates each block
  • 31. Input Reader • Input reader reads a block and divides into splits • Each split would be sent to a map function – a line is an input of a map function • The key could be some internal number (filename-blockid-lineid), the value is the content of the textual line. Apple Orange Mongo Orange Grapes Plum Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Apple Plum Mongo Apple Apple Plum Block 1 Block 2 Apple Orange MongoApple Orange Mongo Orange Grapes PlumOrange Grapes Plum Apple Plum MongoApple Plum Mongo Apple Apple PlumApple Apple Plum Input reader
  • 32. Mapper - Map Function • Mapper takes the output generated by input reader and output a list of intermediate <key, value> pairs. Mapper Apple Orange MongoApple Orange Mongo Orange Grapes PlumOrange Grapes Plum Apple Plum MongoApple Plum Mongo Apple Apple PlumApple Apple Plum Apple, 1 Orange, 1 Mongo, 1 Apple, 1 Orange, 1 Mongo, 1 Orange, 1 Grapes, 1 Plum, 1 Orange, 1 Grapes, 1 Plum, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Apple, 1 Plum, 1 Apple, 1 Apple, 1 Plum, 1 m1 m2 m3 m4
  • 33. Reducer - Reduce Function • Reducer takes the output generated by the Mapper, aggregates the value for each key, and outputs the final result • There is shuffle/sort before reducing.
  • 34. Reducer - Reduce Function Apple, 1 Orange, 1 Mongo, 1 Apple, 1 Orange, 1 Mongo, 1 Orange, 1 Grapes, 1 Plum, 1 Orange, 1 Grapes, 1 Plum, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Apple, 1 Plum, 1 Apple, 1 Apple, 1 Plum, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Orange, 1 Orange, 1 Orange, 1 Orange, 1 Grapes, 1Grapes, 1 Mongo, 1 Mongo, 1 Mongo, 1 Mongo, 1 Plum, 1 Plum, 1 Plum, 1 Plum, 1 Plum, 1 Plum, 1 Apple, 4Apple, 4 Orange, 2Orange, 2 Grapes, 1Grapes, 1 Mongo, 2Mongo, 2 Plum, 3Plum, 3 ReducerShuffle/sort r1 r2 r3 r4 r5
  • 35. Reducer - Reduce Function • The same key MUST go to the same reducer • Different keys CAN go to the same reducer. Orange, 1 Orange, 1 Orange, 1 Orange, 1 Orange, 2Orange, 2 r2 Orange, 1 Orange, 1 Orange, 1 Orange, 1 Grapes, 1Grapes, 1 Orange, 2Orange, 2 Grapes, 1Grapes, 1 r2 r2
  • 36. Combiner • When the map operation outputs its pairs they are already available in memory • For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function • If a combiner is used then the map key-value pairs are not immediately written to the output • Instead they will be collected in lists, one list per each key value. (optional) Apple, 1 Apple, 1 Plum, 1 Apple, 1 Apple, 1 Plum, 1 Apple, 2 Plum, 1 Apple, 2 Plum, 1 combiner
  • 37. Partitioner - Partition function • When a mapper emits a key value pair, it has to be sent to one of the reducers - Which one? • The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers) • In Hadoop, the default partitioner is Hash-Partitioner, which hashes a record’s key to determine which partition (and thus which reducer) the record belongs in • The number of partition is then equal to the number of reduce tasks for the job
  • 38. Importance of Partition • It has a direct impact on overall performance of the job • A poorly designed partitioning function will not evenly distributes the charge over the reducers, potentially loosing all the interest of the map/reduce distributed infrastructure • It maybe sometimes necessary to control the key/value pairs partitioning over the reducers
  • 39. Importance of Partition • If a job’s input is a huge set of tokens and their number of occurrences and that you want to sort them by number of occurrences Without using any customized partitioner Using some customized partitioner
  • 40. Example - Word Count • map(String key, String value): // key: document name; value: document contents; map (k1,v1) -> list(k2,v2) for each word w in value: EmitIntermediate(w, "1"); (If input string is (“abc def ghi abc mno pqr”), Map produces {<“abc”,1”>, <“def”, 1>, <“ghi”, 1>, <“abc”,1>, <“mno”,1>,<“pqr”,1>} • reduce(String key, Iterator values): // key: a word; values: a list of counts; reduce (k2,list(v2)) -> list(v2) int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); (Example: reduce(“abc”, <1,1>) -> 2)
  • 42. Simplified MapReduce Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 43. JobTracker Failure • If the master task dies, a new copy can be started from the last check-pointed state. However, in most cases, the user restarts the job • After restarting JobTracker all the jobs running at the time of the failure should be resubmitted
  • 44. Handling TaskTracker Failure • The JobTracker pings every worker periodically • If no response is received from a worker in a certain amount of time, the master marks the worker as failed • Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. • Any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. • Task tracker will stop sending the heartbeat to the Job Tracker
  • 45. Handling TaskTracker Failure • Job Tracker notices this failure • Hasn’t received a heart beat from 10 mins • Can be configured via mapred.tasktracker.expiry.interval property • Job Tracker removes this task from the task pool • Rerun the Job even if map task has ran completely • Intermediate output resides in the failed task trackers local file system which is not accessible by the reduce tasks.
  • 47. Data flow • Input, final output are stored on a distributed file system – Scheduler tries to schedule map tasks “close” to physical storage location of input data • Intermediate results are stored on local FS of map and reduce workers • Output is often input to another map reduce task
  • 48. Coordination • Master data structures – Task status: (idle, in-progress, completed) – Idle tasks get scheduled as workers become available – When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer – Master pushes this info to reducers • Master pings workers periodically to detect failures
  • 49. Failures • Map worker failure – Map tasks completed or in-progress at worker are reset to idle – Reduce workers are notified when task is rescheduled on another worker • Reduce worker failure – Only in-progress tasks are reset to idle • Master failure – MapReduce task is aborted and client is notified
  • 50. How many Map and Reduce jobs? • M - map tasks, R - reduce tasks • Rule of thumb – Make M and R much larger than the number of nodes in cluster – One DFS chunk per map is common – Improves dynamic load balancing and speeds recovery from worker failure • Usually R is smaller than M because output is spread across R files
  • 51. Mapping Workers to Processors • MapReduce master takes the location information of the input files and schedules a map task on a machine that contains a replica of the corresponding input data • If failed, it attempts to schedule a map task near a replica of that task's input data • When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth
  • 52. Combiner Function • User can specify a Combiner function that does partial merging of the intermediate local disk data before it is sent over the network. • The Combiner function is executed on each machine that performs a map task • Typically the same code is used to implement both the combiner and the reduce functions • Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k – popular words in Word Count • Can save network time by pre-aggregating at mapper – combine(k1, list(v1)) -> v2 – Usually same as reduce function • Works only if reduce function is commutative and associative
  • 53. Partition Function • The users of MapReduce specify the number of reduce tasks/output files that they desire (R) • Data gets partitioned across these tasks using a partitioning function on the intermediate key • A default partitioning function is provided that uses hashing (hash(key) mod R) • In some cases, it may be useful to partition data by some other function of the key. The user of the MapReduce library can provide a special partitioning function.
  • 54. Task Granularity • The map phase has M pieces and the reduce phase has R pieces • M and R should be much larger than the number of worker machines • Having each worker perform many different tasks improves dynamic load balancing and also speeds up recovery when a worker fails • Larger the M and R, more the decisions the master must make • R is often constrained by users because the output of each reduce task ends up in a separate output file • Typically - at Google, M = 200,000 and R = 5,000, using 2,000 worker machines
  • 57. Execution Summary • Distributed Processing – Partition input key/value pairs into chunks, run map() tasks in parallel – After all map()s are complete, consolidate all emitted values for each unique emitted key – Now partition space of output map keys, and run reduce() in parallel • If map() or reduce() fails -> re-execute
  • 58. MapReduce – Data Flow • Input reader – divides input into appropriate size splits which get assigned to a Map function. • Map function – maps file data/split to smaller, intermediate <key, value> pairs. • Partition function – finds the correct reducer: given the key and number of reducers, returns the desired reducer node. (optional) • Compare function – input from the Map intermediate output is sorted according to the compare function. (optional) • Reduce function – takes intermediate values and reduces to a smaller solution handed back to the framework. • Output writer – writes file output.
  • 59. Execution Overview • The MapReduce library in user program splits input files into M pieces of typically 16 MB to 64 MB/piece • It then starts up many copies of the program on a cluster of machines • One of the copies of the program is the master • The rest are workers that are assigned work by the master • There are M map tasks and R reduce tasks to assign • The master picks idle workers and assigns each one a map task or a reduce task • A worker who is assigned a map task reads the contents of the assigned input split • It parses key/value pairs out of the input data and passes each pair to the user-defined Map function • The intermediate key/value pairs produced by the Map function are buffered in memory • The locations of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers
  • 60. Execution Overview • When a reduce worker is notified by the master about these locations, it uses RPC remote procedure calls to read the buffered data from the local disks of the map workers • When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together • The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function • The output of the Reduce function is appended to a final output file for this reduce partition • When all map tasks and reduce tasks have been completed, the master wakes up the user program - the MapReduce call in the user program returns back to the user code • The output of the mapreduce execution is available in the R output files (one per reduce task)
  • 61. MapReduce Advantages • Distribution is completely transparent – Not a single line of distributed programming (ease, correctness) • Automatic fault-tolerance – Determinism enables running failed tasks somewhere else again – Saved intermediate data enables just re-running failed reducers • Automatic scaling – As operations as side-effect free, they can be distributed to any number of machines dynamically • Automatic load-balancing – Move tasks and speculatively execute duplicate copies of slow tasks (stragglers)
  • 63. Need for High-Level Languages • Hadoop is great for large-data processing – But writing Java programs for everything is verbose and slow – Not everyone wants to (or can) write Java code • Solution: develop higher-level data processing languages – Hive - HQL is like SQL – Pig - Pig Latin is a bit like Perl • Hive - data warehousing application in Hadoop – Query language is HQL, variant of SQL – Tables stored on HDFS as flat files
  • 64. Need for High-Level Languages • Pig - large-scale data processing system – Scripts are written in Pig Latin, a dataflow language – Developed by Yahoo, now open source • Common idea – Provide higher-level language to facilitate large-data processing – Higher-level language compiles down to Hadoop jobs
  • 65. Hive - Background • Started at Facebook • Data was collected by nightly cron jobs into Oracle DB • ETL via hand-coded python • Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that • A data warehouse system to facilitate easy data summarization, ad-hoc queries and the analysis of large datasets stored in Hadoop compatible file systems • Supports Hive Query Language (HQL) statements similar to SQL statements Source: cc-licensed slide by Cloudera
  • 66. Hive • HiveQL is a subset of SQL covering the most common statements • HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster • JDBC/ODBC support • Follows schema-on-read design – very fast initial load • Agile data types: Array, Map, Struct, and JSON objects • User Defined Functions and Aggregates • Regular Expression support • Partitions and Buckets (for performance optimization)
  • 67. Hive Components • Shell: allows interactive queries • Driver: session handles, fetch, execute • Compiler: parse, plan, optimize • Execution engine: DAG of stages (MR, HDFS, metadata) • Metastore: schema, location in HDFS, SerDe Source: cc-licensed slide by Cloudera
  • 70. Data Model • Basic column types (int, float, boolean) • Complex types: List / Map ( associate array), Struct CREATE TABLE complex ( col1 ARRAY<INT>, col2 MAP<STRING, INT>, col3 STRUCT<a:STRING, b:INT, c:DOUBLE> ); • Built-in functions – mathematical, statistical, string, date, conditional functions, aggregate functions and functions for working with XML and JSON
  • 71. Data Model • Tables – Typed columns (int, float, string, boolean) – list: map (for JSON-like data) • Partitions – For example, range-partition tables by date • Buckets – Hash partitions within ranges – useful for sampling, join optimization Source: cc-licensed slide by Cloudera
  • 72. Metastore • Database: namespace containing a set of tables • Holds table definitions (column types, physical layout) • Holds partitioning information • Can be stored in Derby, MySQL, and many other relational databases Source: cc-licensed slide by Cloudera
  • 73. Physical Layout • Warehouse directory in HDFS – /user/hive/warehouse • Tables stored in subdirectories of warehouse – Partitions form subdirectories of tables • Actual data stored in flat files – Control char-delimited text or SequenceFiles – With custom SerDe, can use arbitrary format Source: cc-licensed slide by Cloudera
  • 74. Metadata • Data organized into tables • Metadata like table schemas stored in the database metastore • The metastore is the central repository of Hive metadata • Metastore runs in the same process as the Hive service • Loading data into a Hive table results in copying the data file into its working directory and input data is not processed into rows • HiveQL queries use metadata for query execution
  • 75. Tables • Logically made up of the data being stored and the associated metadata describing the layout of the data in the table. • The data can reside in HDFS like system or S3 • Hive stores the metadata in a relational database and not in HDFS • When a table is created, Hive moves the data into its warehouse directory • External table – Hive refers data outside the warehouse directory
  • 76. Partitioning • Hive organizes tables into partitions by dividing a table into coarse-grained parts based on the value of a partition column, such as date • Using partitions makes queries faster on slices of the data • Log files with each record containing a timestamp - If partitioned by date, records for the same date would be stored in the same partition • Queries restricted to a particular date or set of dates are more efficient since only required files are scanned • Partitioning on multiple dimensions allowed. • Defined at table creation time • Separate subdirectory for each partition
  • 77. Bucketing • Partitions further organized in buckets for more efficient queries • Clustered by clause is used to create buckets using the specified column • Data within a bucket can be additionally sorted by one or more columns
  • 78. UDF • Operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type • A UDAF (user-defined aggregate functions ) works on multiple input rows and creates a single output row. COUNT and MAX • A UDTF (user-defined table-generating functions ) operates on a single row and produces multiple rows—a table—as output
  • 79. INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); X = page_view user pv_users Hive QL – Join
  • 81. INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; pv_users pageid_age_sum Hive QL – Group By
  • 83. SELECT pageid, COUNT(DISTINCT userid) FROM page_view GROUP BY pageid page_view result Hive QL – Group By with Distinct
  • 84. page_view Shuffle and Sort Reduce Shuffle key is a prefix of the sort key. Hive QL – Group By with Distinct in Map Reduce
  • 86. Hive Benefits • A easy way to process large scale data • Support SQL-based queries • Provide more user defined interfaces to extend Programmability • Efficient execution plans for performance • Interoperability with other database tools
  • 88. Apache Pig • Framework to analyze large un-structured and semi-structured data on top of Hadoop • Consists of a high-level language for expressing data analysis programs, coupled with infrastructure • Compiles down to MapReduce jobs • Infrastructure layer consists of – a compiler to create sequences of Map-Reduce programs – language layer consists of a textual language called Pig Latin
  • 90. Pig Latin • A scripting language to explore large datasets • Easy to achieve parallel execution of simple data analysis tasks • Complex tasks with multiple interrelated data transformations explicitly encoded • Automatic optimization • Create own functions for special-purpose processing • A script can map to multiple map-reduce jobs
  • 92. Benefits • Faster development – Fewer lines of code (Writing map reduce is like writing SQL queries) – Re-use the code (Pig library, Piggy bank) • Conduct a test: Find the top 5 words with most high frequency – Pig Latin needed 10 lines of code as against 200 lines in Java – Pig execution time was 15 minutes as against 4 hours in Java
  • 94. Language Features • A Pig Latin program is made up of a series of transformations applied to the input data to produce output • A declarative, SQL-like language, the high level language interface for Hadoop • Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop • Keywords - Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order By • Aggregations - Count, Avg, Sum, Max, Min • Schema - Defined at query-time not when files are loaded • UDFs • Packages for common input/output formats
  • 95. Language Features • Virtually all parts of the processing path are customizable: loading, storing, filtering, grouping, and joining can all be altered by UDFs • Writing load and store functions is easy once an InputFormat and OutputFormat exist • Multi-query: pig combines certain types of operations together in a single pipeline to reduce the number of times data is scanned. • Order by provides total ordering across reducers in a balanced way • Piggybank is a repository of UDF Pig functions shared by the Pig community
  • 96. Data Types • Scalar Types - Int, long, float, double, Boolean, null, chararray, bytearray • Complex Types – Field - a piece of data – Tuple - an ordered set of fields – Bag - a collection of tuples – Relation - a bag
  • 97. Data Types • Samples – Tuple is a row in database - ( 0002576169, Tome, 20, 4.0) • Bag – a table or a view in database – an unordered collection of tuples represented using curly braces • {(0002576169 , Tome, 20, 4.0), • (0002576170, Mike, 20, 3.6), • (0002576171 Lucy, 19, 4.0), …. }
  • 98. Running a Pig Latin Script • Local mode – Local host and local file system is used – Neither Hadoop nor HDFS is required – Useful for prototyping and debugging – Suitable only for small datasets • MapReduce mode – Run on a Hadoop cluster and HDFS
  • 99. Running a Pig Latin Script • Batch mode - run a script directly – Pig –x local my_pig_script.pig – Pig –x mapreduce my_pig_script.pig • Interactive mode uses the Pig shell Grunt to run script – Grunt> Lines = LOAD ‘/input/input.txt’ AS (line: char array); – Grunt> Unique = DISTINCT Lines; – Grunt> DUMP Unique;
  • 100. Running a Pig Latin Script
  • 101. Operations • Loading data – LOAD loads input data – Lines=LOAD ‘input/access.log’ AS (line: chararray); • Projection – FOREACH … GENERATE … (similar to SELECT) – takes a set of expressions and applies them to every record
  • 102. Operations • Grouping – collects together records with the same key • Dump/Store – DUMP displays results to screen - The trigger for Pig to start execution – STORE save results to file system • Aggregation – AVG, COUNT, MAX, MIN, SUM
  • 103. Foreach ... Generate • Iterates over the members of a bag • Example – student_data = FOREACH students GENERATE studentid, name • The result of statement is another bag • Elements are named as in the input bag
  • 104. Positional Reference • Fields referred by positional notation or by name (alias) – students = LOAD 'student.txt' USING PigStorage() AS (name:chararray, age:int, gpa:float); – DUMP A; – (John,18,4.0F) – (Mary,19,3.8F) – (Bill,20,3.9F) – studentname = Foreach students Generate $1 as studentname;
  • 106. Group • Groups data in one relation • GROUP and COGROUP operators are identical but COGROUP creates a nested set of output tuples • Both operators work with one or more relations • For readability GROUP is used in statements involving one relation • COGROUP is used in statements involving two or more relations
  • 107. Group grunt> DUMP A; (John, Pasta) (Kate, Burger) (Joe, Orange) (Eve, Apple) Let’s group by the number of characters in the second field: grunt> B = GROUP A BY SIZE($1); grunt> DUMP B; (5,{(John, Pasta),(Eve, Apple)}) (6,{(Kate, Burger),(Joe, Orange)})
  • 108. Dump & Store A = LOAD ‘input/pig/multiquery/A’; B = FILTER A by $1 == “apple”; C = FILTER A by $1 == “apple”; SOTRE B INTO “output/b” STORE C INTO “output/c” Relations B&C both derived from A Prior this would create two MapReduce jobs Pig will now create one MapReduce job with output results
  • 109. Count • Computes the number of elements in a bag. • Requires a preceding GROUP ALL statement for global counts and GROUP BY statement for group counts. • X = FOREACH B GENERATE COUNT(A);
  • 110. Pig Operation - Order • Sorts a relation based on one or more fields • In Pig, relations are unordered • If you order relation A to produce relation X, relations A and X still contain the same elements • student = ORDER students BY gpa DESC;
  • 111. Example 1 raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(','); Read from HDFS Input format is tab delimited Run time schema Row filtering on predicates Group records Group aggregation Store output in file Text comma delimited
  • 112. Example 2 A = load '$widerow' using PigStorage('u0001') as (name: chararray, c0: int, c1: int, c2: int); B = group A by name parallel 10; C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2; D = filter C by c0 > 100 and c1 > 100 and c2 > 100; store D into '$out'; Script Argument Ctrl- A delimited Define column types Require 10 reducer jobs
  • 113. Example 3 – Repartition join register pigperf.jar; A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue); B = foreach A generate user, (double) estimated_revenue; alpha = load ’users' using PigStorage('u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name, city; C = join beta by name, B by user parallel 40; D = group C by $0; E = foreach D generate group, SUM(C.estimated_revenue); store E into 'L3out'; Register UDFs and custom input formats Ctrl- A delimited Join two datasets using 40 reducers Load second file Group after join Refer columns by position
  • 114. Example 3 – Replicated Join register pigperf.jar; A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue); Big = foreach A generate user, (double) estimated_revenue; alpha = load ’users' using PigStorage('u0001') as (name, phone, address, city, state, zip); small = foreach alpha generate name, city; C = join Big by user, small by name using ‘replicated’; store C into ‘out'; Replicated join. Small dataset is second Optimization in joining a big dataset with a small one
  • 115. Example 5: Multiple Outputs Split records into sets Dump Command to display data Store multiple output A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) STORE x INTO 'x_out'; STORE y INTO 'y_out'; STORE z INTO 'z_out';
  • 116. Parallel Independent Jobs D1 = load 'data1' … D2 = load 'data2' … D3 = load 'data3' … C1 = join D1 by a, D2 by b C2 = join D1 by c, D3 by d C1 and C2 are two independent jobs that can run in parallel
  • 118. Logic Plan A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; Load Load Filter Join Group For each Store
  • 119. Physical Plan • 1:1 correspondence with the logical plan • Except for - Join, Distinct, (Co)Group, Order • Several optimizations are automatic
  • 120. Pig Handling • Schema and type checking • Translating into efficient physical dataflow – sequence of one or more MapReduce jobs • Exploiting data reduction opportunities – early partial aggregation via a combiner • Executing the system-level dataflow – running the MapReduce jobs • Tracking progress, errors etc
  • 121. Example Problem • Given user data in one file, and website data in another, find the top 5 most visited pages by users aged 18-25 Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
  • 122. In Pig Latin • Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;
  • 123. Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit … Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Job 1 Job 2 Job 3 Translation to MapReduce
  • 125. Apache Flume • A distributed, reliable and available service for efficiently collecting, aggregating, and moving large amounts of log data • One-stop solution for data collection of all formats • A simple and flexible architecture based on streaming data flows • A robust and fault tolerant architecture with tuneable reliability mechanisms and many failover and recovery mechanisms
  • 126. Apache Flume • Uses a simple extensible data model that allows for online analytic application • Complex flows – Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination – It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops
  • 128. Parallelism • When running in MapReduce mode it’s important that the degree of parallelism matches the size of the dataset • By default, Pig uses one reducer per 1GB of input, up to a maximum of 999 • User can override these parameters by setting pig.exec.reducers.bytes.per.reducer (the default is 1000000000 bytes) and pig.exec.reducers.max (default 999)
  • 129. Parallelism • To explicitly set the number of reducers for each job, use a PARALLEL clause for operators that run in the reduce phase • These include all the grouping and joining operators GROUP, COGROUP, JOIN, CROSS as well as DISTINCT and ORDER • Following line sets the number of reducers to 30 for the GROUP – grouped_records = GROUP records BY year PARALLEL 30; • Alternatively, set the default_parallel option for all subsequent jobs – grunt> set default_parallel 30
  • 130. High Level Overview • Local Files • HDFS • Stdin, Stdout • Twitter • IRC • IMAP HDFS Agent
  • 131. Data Flow Model • A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes • A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop) • A Flume source consumes events delivered to it by an external source like a web server • The external source sends events to Flume in a format that is recognized by the target Flume source
  • 132. Data Flow Model • For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink • A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift RPC Client or Thrift clients written in any language generated from the Flume thrift protocol • When a Flume source receives an event, it stores it into one or more channels
  • 133. Data Flow Model • The channel is a passive store that keeps the event until it’s consumed by a Flume sink • The file channel is one example – it is backed by the local file system • The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow • The source and sink within the given agent run asynchronously with the events staged in the channel
  • 134. HDFS Sink • This sink writes events into the Hadoop Distributed File System (HDFS) • Supports creating text and sequence files along with compression • The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events • Buckets/partitions data by attributes like timestamp or machine where the event originated
  • 135. HDFS Sink • The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events • Hadoop installation is required so that Flume can use the Hadoop jars to communicate with the HDFS cluster • A version of Hadoop that supports the sync() call is required.
  • 136. Reliability & Recoverability • The events are staged in a channel on each agent • The events are then delivered to the next agent or terminal repository (like HDFS) in the flow • The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository • This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow
  • 137. Reliability & Recoverability • Flume uses a transactional approach to guarantee the reliable delivery of the events • The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel • This ensures that the set of events are reliably passed from point to point in the flow • In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.
  • 138. Reliability & Recoverability • The events are staged in the channel, which manages recovery from failure • Flume supports a durable file channel which is backed by the local file system • There’s also a memory channel which simply stores the events in an in- memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered
  • 139. Multi-Agent flow • For data to flow across multiple agents or hops, the sink of the previous agent and source of the current hop need to be Avro type with the sink pointing to the hostname (or IP address) and port of the source
  • 140. Consolidation • A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem • For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster • This can be achieved in Flume by configuring a number of first tier agents with an Avro sink, all pointing to an Avro source of single agent (can use the thrift sources / sinks / clients in such a scenario) • This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination
  • 142. Multiplexing Flow • Flume supports multiplexing the event flow to one or more destinations • This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels • For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value • For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3 • The mapping can be set in the agent’s configuration file
  • 145. Apache Sqoop • An open-source tool to extract data from a relational database into HDFS or HBase • Available for MySQL, PostgreSQL, Oracle, SQL Server and DB2 • A single client program that creates one or more MapReduce jobs to perform their tasks • By default 4 map tasks are used in parallel • Sqoop does not have any server processes • If we assume a table with 1 million records and four mappers, then each will process 2,50,000 records
  • 146. Apache Sqoop • With its knowledge of the primary key column, Sqoop can create four SQL statements to retrieve the data that each use the desired primary key column range as caveats. • In the simplest case, this could be as straightforward as adding something like WHERE id BETWEEN 1 and 250000 to the first statement and using different id ranges for the others. • In addition to writing the contents of the database table to HDFS, Sqoop also provides a generated Java source file (widgets.java) written to the current local directory. • Sqoop uses generated code to handle the de=serialization of table-specific data from the database source before writing it to HDFS.
  • 149. Commands • codegen - Generate code to interact with database records • Create - hive-table Import a table definition into Hive • eval - Evaluate a SQL statement and display the results • export - Export an HDFS directory to a database table • help - List available commands • import - Import a table from a database to HDFS • Import - all-tables Import tables from a database to HDFS • job - Work with saved jobs • List - databases List available databases on a server • List - tables List available tables in a database • merge - Merge results of incremental imports • metastore - Run a standalone Sqoop metastore • version - Display version information
  • 150. Importing data into Hive using Sqoop • Sqoop has significant integration with Hive, allowing it to import data from a relational source into either new or existing Hive tables $ sqoop import –connect jdbc:mysql://10.0.0.100/hadooptest --username hadoopuser -P --table employees --hive-import --hive-table employees
  • 151. Export An export uses HDFS as the source of data and a remote database as the destination % sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 > --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits > --input-fields-terminated-by '0001' ... 10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in 10.8947 seconds (3.7633 bytes/sec) 10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.
  • 152. Export
  • 154. Apache Zookeeper • A set of tools to build distributed applications that can safely handle partial failures • A rich set of building blocks to build a large class of coordination data structures and protocols like distributed queues, distributed locks, and leader election among a group of peers • Runs on a collection of machines for high availability • Avoids single points of failure for reliability • facilitates loosely coupled interactions so that participants that do not need to know about one another • An open source, shared repository of implementations and recipes of common coordination patterns • Built-in services like naming, configuration management, locks and synchronization, group services for high performance co-ordination services for distributed applications
  • 155. Apache Zookeeper • Written in Java • Strongly consistent • Ensemble of Servers • In-memory data • Datasets must fit in memory • Shared hierarchical namespace • Access Control list for each node • Similar to a file system
  • 157. Zookeeper Service • All servers store the copy of data in memory • A leader is elected at start up • Followers respond to clients • All updates go through leaders • Responses are sent when a majority of servers have persisted changes
  • 160. Znodes • A unified concept of a node called a znode • Acts both as a container of data (like a file) and a container of other znodes (like a directory) • Form a hierarchical namespace • Two types - ephemeral or persistent. Set at creation time and not changed later • To build a membership list, create a parent znode with the name of the group and child znodes with the name of the group members (servers)
  • 161. Znodes • Referenced by paths, which are represented as slash-delimited Unicode character strings, like file system paths in Unix • Paths must be absolute, so they must begin with a slash character • Paths are canonical hence each path has a single representation
  • 162. API • create - Creates a znode • delete - Deletes a znode (must not have any children) • exists - Tests if a znode exists and retrieves its metadata • getACL, setACL - Gets/sets the ACL for a znode • getChildren - Gets a list of the children of a znode • getData, setData - Gets/sets data associated with a znode • sync - Synchronizes a client’s view of a znode with ZooKeeper
  • 163. Ephemeral Nodes • Deleted when the creating client’s session ends • May not have children, not even ephemeral ones. • Even though tied to a client session, they are visible to all clients subject to their ACL policy • Ideal for building applications that need to know when certain distributed resources are available
  • 164. Ephemeral Nodes • Example - a group membership service that allows any process to discover the members of the group at any particular time • A persistent znode is not tied to the client’s session and is deleted only when explicitly deleted by any client
  • 165. Sequence Nodes • A sequential znode has a sequence number • A znode created with sequential flag set has the value of a monotonically increasing 10 digit counter, maintained by the parent znode, appended to its name • If a client asks to create a sequential znode with the name /a/b -, for example, then the znode created may have the name /a/b-3.4
  • 166. Sequence Nodes • Another new sequential znode with the name /a/b will have a unique name with a larger value of the counter - for example, /a/b-5 • Sequence numbers can be used to impose a global ordering on events in a distributed system, and may be used by the client to infer the ordering
  • 167. Watches • Allow clients to get notifications when a znode changes (data or children) • Works like a one shot call-back mechanism when connections or znode state changes • A watch set on an exists operation will be triggered when the znode being watched is created, deleted, or has its data updated • A watch set on a getData operation will be triggered when the znode being watched is deleted or has its data updated.
  • 168. Watches • A watch set on a getChildren operation will be triggered when a child of the znode being watched is created or deleted, or when the znode itself is deleted • Triggered only once • To receive multiple notifications, a client needs to reregister the watch • If a client wishes to receive further notifications for the znode’s existence (to be notified when it is deleted, for example), it needs to call the exists operation again to set a new watch
  • 169. High Availability Mechanism • For resilience, the Zookeeper runs in replicated mode on a cluster of machines called an ensemble • Achieves high-availability through replication, and can provide a service as long as a majority of the machines in the ensemble are up • ZooKeeper uses a protocol called Zab that runs in two phases and is repeated indefinitely
  • 170. High Availability Mechanism • Phase 1: Leader election – The machines in an ensemble go through a process of electing a distinguished member, called the leader – The other machines are termed followers. – This phase is finished once a majority (or quorum) of followers have synchronized their state with the leader
  • 171. High Availability Mechanism • Phase 2: Atomic broadcast – All write requests are forwarded to the leader, which broadcasts the update to the followers – When a majority have persisted the change, the leader commits the update, and the client gets a response saying the update succeeded – The protocol for achieving consensus is designed to be atomic, so a change either succeeds or fails. It resembles a two-phase commit
  • 172. References 1. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google) 2. D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!) 3. R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128, 4. A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009 5. http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 6. http://flume.apache.org 7. http://incubator.apache.org/sqoop/ 8. Roman, Javi. "The Hadoop Ecosystem Table". github.com 9. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data John Wiley & Sons 10. "Refactor the scheduler out of the JobTracker". Hadoop Common. Apache Software Foundation 11. Jones, M. Tim (6 December 2011). "Scheduling in Hadoop". ibm.com. IBM 12. "Hadoop and Distributed Computing at Yahoo!". Yahoo! 13. "HDFS: Facebook has the world's largest Hadoop cluster!” Hadoopblog.blogspot.com 14. "Under the Hood: Hadoop Distributed File system reliability with Namenode and Avatarnode". Facebook
  • 173. Thank You Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode

Editor's Notes

  1. Transparent: not a single line of code
  2. Rather than using capital letters, which makes Pig Latin look like SQL, I added Eclipse style highlighting instead. Hopefully this makes clear what are the key words without making it look like a Matisse painting.
  3. No need to think about how many map reduce jobs this decomposes into, or connecting data flows between map reduces jobs, or even that this is taking place on top of map reduce at all.