MapReduce and the New Software Stack

MapReduce and the New
Software Stack
Maruf Aytekin
PhD Student
BAU Computer Engineering Department
Besiktas/Istanbul
January 5, 2015

Outline
• Introduction
• DFS
• MapReduce
• Examples
• Matrix Calculation on Hadoop

Introduction
Modern data-mining or ML applications,
called «big-data analysis» requires us to
manage massive amounts of data quickly.

Important Examples
• The ranking of Web pages by importance,
which involves an iterated matrix-vector
multiplication where the dimension is
many billions.
• Searches in social-networking sites, which
involve graphs with hundreds of millions
of nodes and many billions of edges.
• Processing large amount of text or
streams such as news recommendation.

New software stack
• Not a “supercomputer” (Beowulf etc.)
• “computing clusters” – large collections of
commodity hardware, including conventional
processors (“compute nodes”) connected by
Ethernet cables or inexpensive switches.

Distributed File System
• The new form of file system which features
much larger units than the disk blocks in a
conventional operating system.
• Files can be enormous, possibly a terabytes
in size.
• Files are rarely updated.

Physical Organization
• Files are divided into chunks
• Chunks are replicated

DFS Implementations
• The Google File System (GFS)
• Hadoop Distributed File System (HDFS)
• CloudStore, by Kosmix

MapReduce
Style of computing/framework/pattern.
Implementations:
• MapReduce by Google (internal)
• Hadoop by the Apache Foundation.

MapReduce
Operates exclusively on <key, value> pairs.
(input) <k1, v1>
-> map -> <k2, v2>
-> combine -> <k2, v2>
-> reduce -> <k3, v3> (output)

MapReduce
In brief, a MapReduce computation executes as follows:
• Chunks from a DFS are given to Map tasks.
• These Map tasks turn the chunks into a sequence of
<key, value> pairs.
• The <key,value> pairs from each Map task are
collected by a master controller and sorted by key.
(Combine)
• The keys are divided among all the Reduce tasks, so
all <key,value> pairs with the same key wind up at
the same Reduce task.
• The Reduce tasks work on one key at a time and
processes values for that key then outputs the results
as <key,value> pairs.

Hello World
Word Count
• file01:
Hello World Bye World
• file02:
Hello Hadoop Goodbye Hadoop

Word Count
For the given sample input
the first map emits:
< Hello, 1 >
< World, 1 >
< Bye, 1 >
< World, 1 >
The second map emits:
< Hello, 1 >
< Hadoop, 1 >
< Goodbye, 1 >
< Hadoop, 1 >
Combiner:
After being sorted on the keys:
The output of the first map:
< Bye, 1 >
< Hello, 1 >
< World, 2 >
The output of the second map:
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 1 >

Word Count
Thus the output of the job is:
< Bye, 1 >
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 2 >
< World, 2 >
The Reducer implementation, via the reduce method just sums up
the values, which are the occurrence counts for each key.

M
j
İ
N
k
j
Matrix Calculation
P = M N
k
i
Matrix Data Model for MapReduce:
M (i, j,mij )
N (j, k, njk)
P(1,1) P(1,2)

Matrix Data Files  
for MapReduce
M,0,0,10.0
M,0,2,9.0
M,0,3,9.0
M,1,0,1.0
M,1,1,3.0
M,1,2,18.0
M,1,3,25.2
.
.
.
M, i, j, mij
N,0,0,1.0
N,0,2,3.0
N,0,4,2.0
N,1,0,2.0
N,3,2,-1.0
N,3,6,4.0
N,4,6,5.0
.
.
.
N (j, k, njk)

Map Task
Matrix M
key, value pairs produced as
follows:
Matrix N
key, value pairs produced as
follows:

Application
• Run the application on Hadoop

MapReduce and the New Software Stack

More Related Content

What's hot

Similar to MapReduce and the New Software Stack

Recently uploaded

MapReduce and the New Software Stack