MapReduce and the New
Software Stack
Maruf Aytekin
PhD Student
BAU Computer Engineering Department
Besiktas/Istanbul
January 5, 2015
Outline
• Introduction
• DFS
• MapReduce
• Examples
• Matrix Calculation on Hadoop
Introduction
Modern data-mining or ML applications,
called «big-data analysis» requires us to
manage massive amounts of data quickly.
Important Examples
• The ranking of Web pages by importance,
which involves an iterated matrix-vector
multiplication where the dimension is
many billions.
• Searches in social-networking sites, which
involve graphs with hundreds of millions
of nodes and many billions of edges.
• Processing large amount of text or
streams such as news recommendation.
New software stack
• Not a “supercomputer” (Beowulf etc.)
• “computing clusters” – large collections of
commodity hardware, including conventional
processors (“compute nodes”) connected by
Ethernet cables or inexpensive switches.
Distributed File System
• The new form of file system which features
much larger units than the disk blocks in a
conventional operating system.
• Files can be enormous, possibly a terabytes
in size.
• Files are rarely updated.
Physical Organization
• Files are divided into chunks
• Chunks are replicated
DFS Implementations
• The Google File System (GFS)
• Hadoop Distributed File System (HDFS)
• CloudStore, by Kosmix
HDFS Architecture
Block Replication
MapReduce
Style of computing/framework/pattern.
Implementations:
• MapReduce by Google (internal)
• Hadoop by the Apache Foundation.
MapReduce
Operates exclusively on <key, value> pairs.
(input) <k1, v1>
-> map -> <k2, v2>
-> combine -> <k2, v2>
-> reduce -> <k3, v3> (output)
MapReduce Computation
MapReduce
In brief, a MapReduce computation executes as follows:
• Chunks from a DFS are given to Map tasks.
• These Map tasks turn the chunks into a sequence of
<key, value> pairs.
• The <key,value> pairs from each Map task are
collected by a master controller and sorted by key.
(Combine)
• The keys are divided among all the Reduce tasks, so
all <key,value> pairs with the same key wind up at
the same Reduce task.
• The Reduce tasks work on one key at a time and
processes values for that key then outputs the results
as <key,value> pairs.
Execution of MapReduce
Hello World
Word Count
• file01:
Hello World Bye World
• file02:
Hello Hadoop Goodbye Hadoop
Word Count
For the given sample input
the first map emits:
< Hello, 1 >
< World, 1 >
< Bye, 1 >
< World, 1 >
The second map emits:
< Hello, 1 >
< Hadoop, 1 >
< Goodbye, 1 >
< Hadoop, 1 >
Combiner:
After being sorted on the keys:
The output of the first map:
< Bye, 1 >
< Hello, 1 >
< World, 2 >
The output of the second map:
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 1 >
Word Count
Thus the output of the job is:
< Bye, 1 >
< Goodbye, 1 >
< Hadoop, 2 >
< Hello, 2 >
< World, 2 >
The Reducer implementation, via the reduce method just sums up
the values, which are the occurrence counts for each key.
M
j
İ
N
k
j
Matrix Calculation
P = M N
k
i
Matrix Data Model for MapReduce:
M (i, j,mij )
N (j, k, njk)
P(1,1) P(1,2)
Matrix Data Files 

for MapReduce
M,0,0,10.0
M,0,2,9.0
M,0,3,9.0
M,1,0,1.0
M,1,1,3.0
M,1,2,18.0
M,1,3,25.2
.
.
.
M, i, j, mij
N,0,0,1.0
N,0,2,3.0
N,0,4,2.0
N,1,0,2.0
N,3,2,-1.0
N,3,6,4.0
N,4,6,5.0
.
.
.
N (j, k, njk)
Map
Reduce
Example
Map Task
Matrix M
key, value pairs produced as
follows:
Matrix N
key, value pairs produced as
follows:
Map Task Output
Reduce Task
P =
Application
• Run the application on Hadoop
Thank you!
Q & A

MapReduce and the New Software Stack

  • 1.
    MapReduce and theNew Software Stack Maruf Aytekin PhD Student BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015
  • 2.
    Outline • Introduction • DFS •MapReduce • Examples • Matrix Calculation on Hadoop
  • 3.
    Introduction Modern data-mining orML applications, called «big-data analysis» requires us to manage massive amounts of data quickly.
  • 4.
    Important Examples • Theranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions. • Searches in social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges. • Processing large amount of text or streams such as news recommendation.
  • 5.
    New software stack •Not a “supercomputer” (Beowulf etc.) • “computing clusters” – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.
  • 6.
    Distributed File System •The new form of file system which features much larger units than the disk blocks in a conventional operating system. • Files can be enormous, possibly a terabytes in size. • Files are rarely updated.
  • 7.
    Physical Organization • Filesare divided into chunks • Chunks are replicated
  • 8.
    DFS Implementations • TheGoogle File System (GFS) • Hadoop Distributed File System (HDFS) • CloudStore, by Kosmix
  • 9.
  • 10.
  • 11.
    MapReduce Style of computing/framework/pattern. Implementations: •MapReduce by Google (internal) • Hadoop by the Apache Foundation.
  • 12.
    MapReduce Operates exclusively on <key,value> pairs. (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
  • 13.
  • 14.
    MapReduce In brief, aMapReduce computation executes as follows: • Chunks from a DFS are given to Map tasks. • These Map tasks turn the chunks into a sequence of <key, value> pairs. • The <key,value> pairs from each Map task are collected by a master controller and sorted by key. (Combine) • The keys are divided among all the Reduce tasks, so all <key,value> pairs with the same key wind up at the same Reduce task. • The Reduce tasks work on one key at a time and processes values for that key then outputs the results as <key,value> pairs.
  • 15.
  • 16.
    Hello World Word Count •file01: Hello World Bye World • file02: Hello Hadoop Goodbye Hadoop
  • 17.
    Word Count For thegiven sample input the first map emits: < Hello, 1 > < World, 1 > < Bye, 1 > < World, 1 > The second map emits: < Hello, 1 > < Hadoop, 1 > < Goodbye, 1 > < Hadoop, 1 > Combiner: After being sorted on the keys: The output of the first map: < Bye, 1 > < Hello, 1 > < World, 2 > The output of the second map: < Goodbye, 1 > < Hadoop, 2 > < Hello, 1 >
  • 18.
    Word Count Thus theoutput of the job is: < Bye, 1 > < Goodbye, 1 > < Hadoop, 2 > < Hello, 2 > < World, 2 > The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key.
  • 19.
    M j İ N k j Matrix Calculation P =M N k i Matrix Data Model for MapReduce: M (i, j,mij ) N (j, k, njk) P(1,1) P(1,2)
  • 20.
    Matrix Data Files
 for MapReduce M,0,0,10.0 M,0,2,9.0 M,0,3,9.0 M,1,0,1.0 M,1,1,3.0 M,1,2,18.0 M,1,3,25.2 . . . M, i, j, mij N,0,0,1.0 N,0,2,3.0 N,0,4,2.0 N,1,0,2.0 N,3,2,-1.0 N,3,6,4.0 N,4,6,5.0 . . . N (j, k, njk)
  • 21.
  • 22.
  • 23.
  • 24.
    Map Task Matrix M key,value pairs produced as follows: Matrix N key, value pairs produced as follows:
  • 25.
  • 26.
  • 27.
    Application • Run theapplication on Hadoop
  • 28.