Map Reduce and
Hadoop
- S A L IL NAVG IR E
Big Data Explosion
• 90% of today's data was created in the last 2 years

• Moore's law: Data volume doubles every 18
months
• YouTube: 13 million hours and 700 billion views in
2010
• Facebook: 20TB/day (compressed)
• CERN/LHC: 40TB/day (15PB/year)

• Many more examples
Solution: Scalability
How?
Divide and Conquer
Challenges!
• How to assign units of work to the workers?

• What if there are more units of work than workers?
• What if the workers need to share intermediate
incomplete data?

• How do we aggregate such intermediate data?
• How do we know when all workers have completed
their assignments?

• What if some workers failed?
History
• 2000: Apache Lucene: batch index updates and
sort/merge with on disk index
• 2002: Apache Nutch: distributed, scalable open
source web crawler
• 2004: Google publishes GFS and MapReduce
papers
• 2006: Apache Hadoop: open source Java
implementation of GFS and MapReduce to solve
Nutch’ problem; later becomes standalone project
What is Map Reduce?
• A programming model to distribute a task on
multiple nodes
• Used to develop solutions that will process large
amounts of data in a parallelized fashion in clusters
of computing nodes

• Original MapReduce paper by Google
• Features of MapReduce:
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
MapReduce Execution Overview
User
Program
fork
assign
map
Input Data
Split 0
read
Split 1
Split 2

fork
Master

fork

assign
reduce

Worker

Worker

Worker

local
write

Worker
Worker

remote
read,
sort

write

Output
File 0
Output
File 1
Hadoop Components

Storage

Processing

HDFS

MapReduce

Self-healing
high-bandwidth
clustered storage

Fault-tolerant
distributed
processing
HDFS Architecture
HDFS Basics
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for massive
amounts of data

• Use Commodity devices
HDFS Data
• Data is split into blocks and stored on
multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
• Each block is replicated multiple times
• Replicas stored on different data nodes
2 Types of Nodes
Slave Nodes
Master Nodes
Master Node
• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeeping

• JobTracker
• only 1 per cluster
• job scheduler
Slave Nodes
• DataNodes
• 1-4000 per cluster
• block data storage

• TaskTrackers
• 1-4000 per cluster
• task execution
NameNode
• A single NameNode stores all
metadata, replication of blocks and
read/write access to files
• Filenames, locations on DataNodes of each
block, owner, group, etc.
• All information maintained in RAM for fast
lookup
Secondary NameNode
• Does memory-intensive administrative
functions for the NameNode
• Should run on a separate machine
Data Node
• DataNodes store file contents
• Different blocks of the same file will be
stored on different DataNodes
• Same block is stored on three (or more)
DataNodes for redundancy
Word Count Example
• Input
• Text files

• Output
• Single file containing (Word <TAB> Count)

• Map Phase
• Generates (Word, Count) pairs
• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase
• For each word, calculates aggregate
• [{a,7}, {b,5}, {c,6}]
Typical Cluster
• 3-4000 commodity servers

• Each server
• 2x quad-core
• 16-24 GB ram

• 4-12 TB disk space

• 20-30 servers per rack
When Should I use it?
Good choice for jobs that can be broken into parallelized jobs:

• Indexing/Analysis of log files
• Sorting of large data sets
• Image Processing/Machine Learning

Bad choice for serial or low latency jobs:
• For real-time processing
• For processing intensive task with little data
• Replacing MySQL
Who uses Hadoop?

MapReduce and Hadoop

  • 1.
    Map Reduce and Hadoop -S A L IL NAVG IR E
  • 2.
    Big Data Explosion •90% of today's data was created in the last 2 years • Moore's law: Data volume doubles every 18 months • YouTube: 13 million hours and 700 billion views in 2010 • Facebook: 20TB/day (compressed) • CERN/LHC: 40TB/day (15PB/year) • Many more examples
  • 3.
  • 4.
    Challenges! • How toassign units of work to the workers? • What if there are more units of work than workers? • What if the workers need to share intermediate incomplete data? • How do we aggregate such intermediate data? • How do we know when all workers have completed their assignments? • What if some workers failed?
  • 5.
    History • 2000: ApacheLucene: batch index updates and sort/merge with on disk index • 2002: Apache Nutch: distributed, scalable open source web crawler • 2004: Google publishes GFS and MapReduce papers • 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project
  • 6.
    What is MapReduce? • A programming model to distribute a task on multiple nodes • Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes • Original MapReduce paper by Google • Features of MapReduce: • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers
  • 7.
    MapReduce Execution Overview User Program fork assign map InputData Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1
  • 8.
  • 9.
  • 10.
    HDFS Basics • HDFSis a filesystem written in Java • Sits on top of a native filesystem • Provides redundant storage for massive amounts of data • Use Commodity devices
  • 11.
    HDFS Data • Datais split into blocks and stored on multiple nodes in the cluster • Each block is usually 64 MB or 128 MB • Each block is replicated multiple times • Replicas stored on different data nodes
  • 12.
    2 Types ofNodes Slave Nodes Master Nodes
  • 13.
    Master Node • NameNode •only 1 per cluster • metadata server and database • SecondaryNameNode helps with some housekeeping • JobTracker • only 1 per cluster • job scheduler
  • 14.
    Slave Nodes • DataNodes •1-4000 per cluster • block data storage • TaskTrackers • 1-4000 per cluster • task execution
  • 15.
    NameNode • A singleNameNode stores all metadata, replication of blocks and read/write access to files • Filenames, locations on DataNodes of each block, owner, group, etc. • All information maintained in RAM for fast lookup
  • 16.
    Secondary NameNode • Doesmemory-intensive administrative functions for the NameNode • Should run on a separate machine
  • 17.
    Data Node • DataNodesstore file contents • Different blocks of the same file will be stored on different DataNodes • Same block is stored on three (or more) DataNodes for redundancy
  • 18.
    Word Count Example •Input • Text files • Output • Single file containing (Word <TAB> Count) • Map Phase • Generates (Word, Count) pairs • [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase • For each word, calculates aggregate • [{a,7}, {b,5}, {c,6}]
  • 19.
    Typical Cluster • 3-4000commodity servers • Each server • 2x quad-core • 16-24 GB ram • 4-12 TB disk space • 20-30 servers per rack
  • 20.
    When Should Iuse it? Good choice for jobs that can be broken into parallelized jobs: • Indexing/Analysis of log files • Sorting of large data sets • Image Processing/Machine Learning Bad choice for serial or low latency jobs: • For real-time processing • For processing intensive task with little data • Replacing MySQL
  • 21.