Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware. Hadoop
Hadoop Distributed Filesystem: HDFS stores data
on nodes in the cluster with the goal of providing greater
bandwidth across the cluster.
Hadoop MapReduce : It is a computational paradigm
called Map/Reduce, which takes an application and divides
it into multiple fragments of work, each of which can be executed on any node in the cluster
Hadoop Distributed Filesystem HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
Based on Google’s GFS
Redundant storage of massive amounts of data on cheap and unreliable computers
Why not use an existing file system?
– Different workload and design priorities
– Handles much bigger dataset sizes than other
High component failure rates
– Inexpensive commodity components fail all the time
“ Modest” number of HUGE files
– Just a few million
– Each is 100MB or larger; multi-GB files typical
Files are write-once, mostly appended to
– Perhaps concurrently
Large streaming reads
Files stored as blocks
– Much larger size than most filesystems (default is 64MB)
Reliability through replication
– Each block replicated across 3+ DataNodes
Single master (NameNode) coordinates access, metadata
-- Simple centralized management
No data caching
– Little benefit due to large data sets, streaming
Familiar interface, but customize the API
– Simplify the problem; focus on distributed apps
Hdfs NameNode Hdfs DataNode Hdfs DataNode Hdfs aware application Posix API HDFS API Hdfs view Network Stack Regular filesystem Specific drivers.. HDFS Client Block Diagram Client computer
HDFS Master “Namenode”
– Manages all filesystem metadata
Transactions are logged, merged at startup
– Controls read/write access to files
– Manages block replication
HDFS Slaves “Datanodes”
– Notifies NameNode about block-IDs it has
– Serve read/write requests from clients
– Perform replication tasks upon instruction by namenode
HDFS: Handling Failures
– A single point of failure
Secondary NameNode provides consistency
– Copies FsImage and Transaction Log from
NameNode & merges them
– Uploads new FSImage to the NameNode
Uses CRC checks to avoid data corruption.
Hadoop MapReduce Map Reduce is a programming model and an associated implementation for processing and generating large data sets.
Map/Reduce Programming Model
Borrows from functional programming
Users implement interface of two Functions: – map (in_key, in_value) -> (out_key, intermediate_value) list
– reduce (out_key, intermediate_value list) -> out_value list
Records from the data source are fed into the map function as key*value pairs.
map() produces one or more intermediate values along with an output key from the input.
map (in_key, in_value) ->(out_key, intermediate_value) list
Example: Upper-case Mapperlet
let map(k, v) = emit(k.toUpper(), v.toUpper())
(“foo”, “bar”) => (“FOO”, “BAR”)
(“Foo”, “other”) => (“FOO”, “OTHER”)
(“key2”, “data”) => (“KEY2”, “DATA”)
After the map phase is over, all the intermediate values for a given output key are combined together into a list
reduce() combines those intermediate values into one or more final values for that same output key
(in practice, usually only one final value per key)
reduce (out_key, intermediate_value list) -> out_value list
Example: Sum Reducer
let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum)
MapReduce DataFlow Example:Word Count Hi,how are you? Iam good Hello Hello how are you? Not so good Are 1 hi 1 how 1 you 1 Are 1 Hello 1 Hello 1 how 1 you 1 Are 2 Hello 2 Hi 1 how 2 you 2 Are[1 1] Hello[1 1] Hi how[1 1] you[1 1] Map Reduce Input Intermediate results Output merged Sorted
map() functions run in parallel, creating different intermediate values from different input data sets.
reduce() functions also run in parallel, each working on a different output key.
All values are processed independently
Bottleneck: reduce phase can’t start until map phase is completely finished.
Run on mapper nodes after map phase
“ Mini-reduce,” only on local map output
Used to save bandwidth before sending data to full reducer
Reducer can be combiner if commutative & associative
– e.g., SumReducer
Hadoop Map-Reduce Architecture
Map-Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to
– Monitors task and tasktracker status,
re-executes tasks upon failure
Hadoop Map-Reduce Architecture
Map-Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction from the Jobtracker
– Manage storage and transmission of intermediate output
Define Mapper and Reducer classes and a “launching” program.