Hadoop & MapReduce

Hadoop & MapReduce
Dr. Ioannis Konstantinou
http://www.cslab.ntua.gr/~ikons

AWS Usergroup Greece
18/07/2012

Computing Systems Laboratory
School of Electrical and Computer Engineering
National Technical University of Athens

Big Data
90% of today's data was created in the last 2 years
Moore's law: Data volume doubles every 18 months
YouTube: 13 million hours and 700 billion views in 2010
Facebook: 20TB/day (compressed)
CERN/LHC: 40TB/day (15PB/year)

Many more examples
Web logs, presentation files, medical files etc

Problem: Data explosion

1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes)
Data traffic of mobile telephony in the USA in 2010

1.2 ZB (Zettabyte) = 1200 EB
Total of digital data in 2010

35 ZB (Zettabyte = 1021 bytes)
Estimate for volume of total digital
data in 2020

Solution: scalability

How?

Source: Wikipedia (IBM Roadrunner)

Divide and Conquer
“Problem”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine

Parallelization challenges
 How to assign units of work to the workers?
 What if there are more units of work than workers?
 What if the workers need to share intermediate incomplete
data?
 How do we aggregate such intermediate data?
 How do we know when all workers have completed their
assignments?
 What if some workers failed?

What is MapReduce?
A programming model
A programming framework
Used to develop solutions that will
 Process large amounts of data in a parallelized fashion

 In clusters of computing nodes

Originally a closed-source implementation at Google
 Scientific papers of ’03 & ’04 describe the framework

Hadoop: opensource implementation of the algorithms described in
the scientific papers
 http://hadoop.apache.org/

What is Hadoop?
 2 large subsystems, 1 for data management & 1 for computation:
 HDFS (Hadoop Distributed File System)

 MapReduce computation framework runs above HDFS

 HDFS is essentially the I/O of Hadoop

 Written in java: A set of java processes running in multiple nodes

 Who uses it:
 Yahoo!

 Amazon

 Facebook

 Twitter

 Plus many more...

HDFS – distributed file system

 A scalable distributed file system for applications dealing with
large data sets.
 Distributed: runs in a cluster

 Scalable: 10Κ nodes, 100Κ files 10PB storage

 Storage space is seamless for the whole cluster
 Files broken into blocks
 Typical block size: 128 MB.
 Replication: Each block copied to multiple data nodes.

Architecture of HDFS/MapReduce
 Master/Slave scheme
 HDFS: A central NameNode administers multple DataNodes

 NameNode: holds information about which DataNode holds which files
 DataNodes: «dummy» servers that hold raw file chunks
 MapReduce: A central JobTracker administers multiple TaskTrackers

-NameNode and JobTracker
They run on the master
-DataNode and TaskTracker
They run on the slaves

MapReduce
The problem is broken down in 2 phases.
●
Map: Non overlapping sets of data input
(<key,value> records) are assigned to different
processes (mappers) that produce a set of
intermediate <key,value> results
●
Reduce: Data of Map phase are fed to a typically
smaller number of processes(reducers) that
aggregate the input results to a smaller number of
<key,value> records.

Initialization phase
Input is uploaded to HDFS and is split into pieces of
fixed size
Each TaskTracker node that participates in the
computation is executing a copy of the MapReduce
program
One of the nodes plays the JobTracker master role.
This node will assign tasks to the rest (workers). Tasks
can either be of type map or reduce.

JobTracker (Master)
The jobTracker holds data about:
Status of tasks

Location of input, output and intermediate data (runs
together with NameNode - HDFS master)
The master is responsible for timecheduling of work
tasks execution.

TaskTracker (Slave)
The TaskTracker runs tasks assigned by the master.
Runs at the same node as the DataNode (HFDS slave)
Task can be either of type Map or type Reduce
Typically the maximum number of concurrent tasks
that can be run by a node is equal to the number of
cpu cores it has (achieving optimal CPU utilization)

Map task
 A worker (TaskTracker) that has been assigned a map task
●
Reads the relevant input data (input split) from HDFS, analyzes the <key, value>
pairs and the output is passed as input to the map function.
●
The map function processes the pairs and produces intermediate pairs that are
aggregated in memory.
●
Periodically a partition function is executed which stores the intermediate key-
value pairs in the local node storage, while grouping them in R sets.This function
is user defined.
●
When the partition function completes the storage of the key-value pairs it
informs the master that the task is complete and where the data are stored.
●
The master forwards this information to the workers that run the reduce tasks

Reduce task
 A worker that has been assigned a reduce task

 Reads from every map process that has been executed the pairs that
correspond to itself based on the locations instructed by the master.
 When all intermediate pairs have been retrieved they are sorted based on
their key. Entries with the same key are grouped together.
 Function reduce is executed with input the pairs <key, group_of_values>
that were the result of the previous phase.
 The reduce task processes the input data and produces the final pairs.

 The output pairs are attached in a file in the local file system. When the
reduce task is completed the file becomes available in the distributed file
system.

Task Completion
When a worker has completed its task it informs
the master.
When all workers have informed the master then
the master will return the function to the original
program of the user.

Example
Master

worker
Map Reduce
worker
Part 1

Part 2
Input worker
Map Reduce
worker Output

Part 3
worker
Map Reduce
worker

Example: Word count 1/3

 Objective: measure the frequency of appearance of words in a large set
of documents
 Potential use case: Discovery of popular url in a set of webserver
logfiles
 Implementation plan:
 “Upload” documents on MapReduce

 Author a map function

 Author a reduce function

 Run a MapReduce task

 Retrieve results

map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(result)

(w1, 2) (w1,2)
(d1, ‘’w1 w2 w4’)
(w2, 3) (w2,3)
(d2, ‘ w1 w2 w3 w4’)
(w3, 2) (w1,3)
(d3, ‘ w2 w3 w4’)
(w4,3) (w2,4)
(w1,3) (w1,7)
(w2,3) (w2,15)
(d4, ‘ w1 w2 w3’) (w1,3)
(d5, ‘w1 w3 w4’) (w2,4)
(d6, ‘ w1 w4 w2 w2) (w3,2)
(d7, ‘ w4 w2 w1’) (w4,3)
(w3,2) (w3,8)
(w4,3) (w4,7)
(d8, ‘ w2 w2 w3’) (w1,3) (w3,2)
(d9, ‘w1 w1 w3 w3’) (w2,3) (w4,3)
(d10, ‘ w2 w1 w4 w3’) (w3,4) (w3,4)
(w4,1) (w4,1)

M=3 mappers R=2 reducers

Locality

Move computation near the data: The master tries to
have a task executed on a worker that is as “near” as
possible to the input data, thus reducing the
bandwidth usage
How does the master know?

Task distribution
The number of tasks is usually higher than the
number of the available workers
One worker can execute more than one tasks
The balance of work load is improved. In the case
of a single worker failure there is faster recovery
and redistribution of tasks to other nodes.

Redundant task executions
Some tasks can be delayed, resulting in a delay in the
overall work execution
The solution to the problem is the creation of task
copies that can be executed in parallel from 2 or more
different workers (speculative execution)
A task is considered complete when the master is
informed about its completion by at least one node.

Partitioning
A user can specify a custom function that will
partition the tasks during shuffling.
The type of input and output data can be defined by
the user and has no limitation on what form it should
have.

The input of a reducer is always sorted
There is the possibility to execute tasks locally in a
serial manner
The master provides web interfaces for
Monitoring tasks progress

Browsing of HDFS

When should I use it?
Good choice for jobs that can be broken into parallelized jobs:
 Indexing/Analysis of log files

 Sorting of large data sets

 Image processing

•
Bad choice for serial or low latency jobs:
–
Computation of number π with precision of 1,000,000 digits
–
Computation of Fibonacci sequence
–
Replacing MySQL

Use cases 1/3
 Large Scale Image Conversions
 100 Amazon EC2 Instances, 4TB raw TIFF data
 11 Million PDF in 24 hours and 240$
•
Internal log processing
•
Reporting, analytics and machine learning
•
Cluster of 1110 machines, 8800 cores and 12PB
raw storage
•
Open source contributors (Hive)

•
Store and process tweets, logs, etc
•
Open source contributors (hadoop-lzo)
•
Large scale machine learning

Use cases 2/3
 100.000 CPUs in 25.000 computers

 Content/Ads Optimization, Search index

 Machine learning (e.g. spam filtering)

 Open source contributors (Pig)

•
Natural language search (through
Powerset)
•
400 nodes in EC2, storage in S3
•
Open source contributors (!) to HBase
•
ElasticMapReduce service
•
On demand elastic Hadoop clusters for the
Cloud

Use cases 3/3
ETL processing, statistics generation
Advanced algorithms for behavioral
analysis and targeting
•
Used for discovering People you May Know,
and for other apps
•
3X30 node cluster, 16GB RAM and 8TB
storage
•
Leading Chinese language search engine
•
Search log analysis, data mining
•
300TB per week
•
10 to 500 node clusters

Amazon ElasticMapReduce (EMR)
A hosted Hadoop-as-a-service solution provided by AWS
 No need for management or tuning of Hadoop clusters
●
upload your input data, store your output data on S3
●
procure as many EC2 instances as you need and only pay for the
time you use them
 Hive and Pig support makes it easy to write data analytical scripts

 Java, Perl, Python, PHP, C++ for more sophisticated algorithms

 Integrates to dynamoDB (process combined datasets in S3 &
dynamoDB)
 Support for HBase (NoSQL)

Hadoop & MapReduce

More Related Content

What's hot

Similar to Hadoop & MapReduce

More from Newvewm

Recently uploaded

Hadoop & MapReduce