Hadoop distributed computing framework for big data
Hadoop Distributed Computing Framework for Big Data
The Motivation for Hadoop
• Hadoop is an open source distributed computing framework
for large-scale data sets processing.
• Created by Doug Cutting, origins in Apache Nutch, moved
out from Nutch in 2006
• Based on Google GFS paper (2003) and MapReduce Paper
(Jeff Dean, 2004), Google 200 clusters, each has 1000+ nodes
• Yahoo ： 42000nodes，LinkedIn: 4100 nodes, Facebook:
1400, eBay: 500, TaoBao: 2000(biggest in CN)
• Echosystem: HBase, Hive, Pig, Zookeeper, Oozie, Mahout….
• Problems in traditional big data processing（MPI, Grid
Computing, Volunteer Computing）:
✴It’s difﬁcult to deal with partial failures of the system.
✴Finite and precious bandwidth must be available to
combine data from different disks and transfer time is
very slow for big data volume.
✴Data exchange requires synchronization.
✴Temporal dependencies are complicated.
How Hadoop Save Big Data
• Hadoop provide partial failure support. Hadoop Distributed File System
(HDFS) can store large data sets with high reliability and scalability.
• HDFS provide great fault tolerance. Partial Failure will not result in the
failure of the entire system. And HDFS provide data recoverability for partial
• Hadoop introduce MapReduce, which spares programmers from low-level
details, like partial failure. The MapReduce framework will detect failed tasks
and reschedule them automatically.
• Hadoop provide data locality. The MapReduce framework tries to collocate
data with the compute nodes. Data is local, and tasks are separated with no
dependence on each other. So the shared-nothing and data locality
architecture can save more bandwidth and solve the complicated dependence
Hadoop Basic Concepts
• The core concepts for Hadoop are to distribute the
data as it is initially stored in the system. That is
• Applications are written in high-level code.
• Nodes Dependency as little as possible.
• Data Replica, data is spread among machines in
Hadoop High-Level Overview
• HDFS (Hadoop Distributed File System), which is
a distributed ﬁle system designed to store large data
sets and streaming data sets on commodity
hardware with high scalability, reliability and
• MapReduce is a parallel programming model and
an associated implementation for processing and
generating large data sets. It provides a clean
abstraction for programmers.
• NameNode: HDFS namespace and
• Secondary NameNode, which performs
housekeeping functions for NameNode, and
isn’t a backup or hot standby for the
• DataNode, which stores actual HDFS data
blocks. In Hadoop, a large ﬁle is split into
64M or 128M blocks.
• JobTracker, which manages MapReduce
jobs, distributes individual tasks to machines
• TaskTracker, which initiates and monitors
each individual Map and Reduce tasks.
Each Daemon Runs its own JVM
POSIX: Portable Operating System Interface
• fs -copyFromLocal conf input
• bin/hadoop jar hadoop-examples-1.2.1.jar grep input
• bin/hadoop fs -cat output/*
• localhost:50030, check MapReduce status
• localhost:50070, check HDFS status
HDFS: Basic Concepts
• Highly fault-tolerant: handle partial failure
• Streaming Data Access: Block Data(64 MB,
• Large data sets: GB, TB,PB
• DataNode: store
actual data blocks
• NameNode Data Persistent: FSImage and EditLog
✤ FSImage persistent for ﬁlesystem tree, mapping
of ﬁles and blocks, ﬁlesystem properties
✤ No persistent for block physical locations, which
are in RAM
• Checkpoint: Merge Editlog with FSImage
• Secondary NameNode Housekeeping: Periodically
HDFS: Data Replica
• 3 Replica: high reliability
• one replica on one node
in the local rack
• the second one on a
node in a different
• the third one on a
different node in the
same remote rack.
SPOF: HDFS Federation
• Scale NameNode
• Each NameNode has
• DataNode: Stores
blocks from different
SPOF: Single Point of Failure
SPOF: HDFS High Availability(HA)
• A ad-hoc standby
• Active NN write update
to shared NFS
• Standby NN pulls and
merges logs, up-to-date in
• DataNodes: sends Block
reports to both NN
• Failover in tens of seconds
• Map task is to process a key/value pair to generate a set of
intermediate key/value pairs.
✴ Input: key is the offset of each line, value is each line
✴ Output: <apple, 1>…<pear, 1>, <peach, 1>, written to local disk not HDFS
• Reduce task is to merge all intermediate values associated with the
same intermediated key
• Shufﬂe and sort
• Input: the output from map task, with the same key, like : <apple, 1> … <apple, 1>
• Output: <apple, 5>, written to HDFS
• No reduce task can start until every map task has ﬁnished (Speculative Execution)
Memory dynamic grained(1G~10G), not ﬁxed slots
No JVM reuse, each task runs on each JVM
MapReduce is kind of Application
App Master Aggregates Job status, not Resource Manager
When not use Hadoop?
• Low-latency Data Access: real-time needs, HBase
• Structured Data: RDBMS, ad-hoc sql query
• When data isn’t that big: Hadoop needs TB and PB, not GB
• Too many small ﬁles
• Write more than read
• MapReduce may be not the best choice: data no
dependency, and parallel.