The document discusses MapReduce and the Hadoop framework. It provides an overview of how MapReduce works, examples of problems it can solve, and how Hadoop implements MapReduce at scale across large clusters in a fault-tolerant manner using the HDFS distributed file system and YARN resource management.
2. MapReduce is a programming model Google
has used successfully is processing its “big-
data” sets (~ 20000 peta bytes per day)
Users specify the computation in terms of a map
and a reduce function,
Underlying runtime system automatically
parallelizes the computation across large-scale
clusters of machines, and
Underlying system also handles machine failures,
efficient communications, and performance issues.
.
2
3. Consider a large data collection:
{web, weed, green, sun, moon, land, part, web, green,…}
Problem: Count the occurrences of the different words in
the collection.
Lets design a solution for this problem;
We will start from scratch
We will add and relax constraints
We will do incremental design, improving the solution for
performance and scalability
3
4. Very large scale data: peta, exa bytes
Write once and read many data: allows for parallelism
without mutexes
Map and Reduce are the main operations: simple code
There are other supporting operations such as combine and
partition (out of the scope of this talk).
All the map should be completed before reduce operation
starts.
Map and reduce operations are typically performed by the
same physical processor.
Number of map tasks and reduce tasks are configurable
4
5. 5
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
6. At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
GFS is not open source.
Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
The software framework that supports HDFS,
MapReduce and other related entities is called
the project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
6
7. Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
7
8. Hadoop is a software framework for distributed
processing of large datasets across large clusters of
computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
Hadoop is open-source implementation for Google
MapReduce
Hadoop is based on a simple programming model
called MapReduce
Hadoop is based on a simple data model, any data
will fit
8
9. Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
9
10. Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
Users only provide two functions “map” and
“reduce”
10
11. Google: Inventors of MapReduce computing
paradigm
Yahoo: Developing Hadoop open-source of
MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs
11
12. 12
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
13. 13
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
14. Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
Replication: Each data block is replicated many
times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core
architectural goal of HDFS
Namenode is consistently checking Datanodes
14
15. 15
Deciding on what will be the key and what will be the value
developer’s responsibility
16. Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency
control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known
schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over
thousands of machines
- Simple yet efficient fault
tolerance
Key Characteristics - Efficiency, optimizations, fine-
tuning
- Scalability, flexibility, fault
tolerance
16
19. Hadoop is being used for all kinds of tasks
beyond its original design
Tight coupling of a specific programming
model with the resource management
infrastructure
Centralized handling of jobs’ control flow
20. Scalability
Multi-tenancy
Serviceability
Locality Awareness
High Cluster Utilization
◦ HoD does not resize the cluster between stages
◦ Users allocate more nodes than needed
Competing for resources results in longer latency to
start a job
21. Scalability
Multi-tenancy
Serviceability
Locality Awareness
High Cluster Utilization
Reliability/Availability
Secure and auditable operation
Support for Programming Model Diversity
Flexible Resource Model
◦ Hadoop: # of Map/reduce slots are fixed.
◦ Easy, but lower utilization
22. Separating resource management functions
from the programming model
MapReduce becomes just one of the
application
Dryad, …. Etc
Binary compatible/Source compatible
23.
24. The head of a job
Runs as a container
Request resources from RM
◦ # of containers/ resource per container/ locality …
Dynamically changing resource consumption
Can run any user code (Dryad, MapReduce,
Tez, REEF…etc)
Requests are “late-binding”
25. Optimizes for locality among map tasks with
identical resource requirements
◦ Selecting a task with input data close to the
container.
AM determines the semantics of the success
or failure of the container
26. 1. Submitting the application by passing a CLC for the
Application Master to the RM.
2. When RM starts the AM, it should register with the RM
and periodically advertise its liveness and requirements
over the heartbeat protocol
3. Once the RM allocates a container, AM can construct a
CLC to launch the container on the corresponding NM. It
may also monitor the status of the running container and
stop it when the resource should be reclaimed.
Monitoring the progress of work done inside the
container is strictly the AM’s responsibility.
4. Once the AM is done with its work, it should unregister
from the RM and exit cleanly.
5. Optionally, framework authors may add control flow
between their own clients to report job status and expose
a control plane.
27. RM Failure
Recover using persistent storage
Kill all containers, including AMs’
Relaunch AMs
NM Failure
RM detects it, mark the containers as killed, report
to Ams
AM Failure
RM kills the container and restarts it.
Container Failure
The framework is responsible for recovery
28. In a 2500-node cluster, throughput improves
from 77 K jobs/day to 150 K jobs/day
29. Pig, Hive, Oozie
Decompose a DAG job into multiple MR jobs
Apache Tez
DAG execution framework
Spark
Dryad
Giraph
Vertice centric graph computation framework
fits naturally within YARN model
Storm – distributed real time processing engine
(parallel stream processing)
REEF
Simplify implementing ApplicationMaster
Haya – Hbase clusters
Editor's Notes
CC BY 3.0 http://creativecommons.org/licenses/by/3.0/deed.en_US
Figures are copied from the original paper and therefore owned by ACM
Add a figure to show the system diagram
High level frameworks compose workflow as a DAG of Mapreduce jobs. The size of nodes in each stage may be different.
The use of AM satisfies scalability, programing model flexibility, improved upgrading/testing.
Late-binding: the received lease may not be the same as the request. AM must accommodate the difference.
When the AM receives a container, it matches it against the set of pending map tasks, selecting a task with input data close to the container.
If the AM decides to run a map task mi in the container, then the hosts storing replicas of mi’s input data are less desirable.
The AM will update its request to diminish the weight on the other k-1 hosts.
Even a simple AM can be fairly complex. Frameworks to ease development of YARN applications exist.We explore some of these in section 4.2. Client libraries - YarnClient, NMClient, AMRMClient - ship with YARN and expose higher level APIs to avoid coding against low level protocols.
Work is in progress to add sufficient protocol support for AMs to survive RM restart
Essentially, moving to YARN, the CPU utilization almost doubled
One of the most important architectural differences that partially explains these improvements is the removal of the static split between map and reduce slots.