MapReduce and the 

art of ā€œThinking Parallelā€
Shailesh Kumar
Third Leap, Inc.
Three I’s of a great product!
Interface Intuitive |Functional | Elegant
Infrastructur
e
Storage |Computation |
Network
Intelligence Learn |Predict | Adapt |
Evolve
Drowning in Data, Starving for Knowledge
ATATTAGGTTTTTACCTACCC
AGGAAAAGCCAACCAACCTC
GATCTCTTGTAGATCTGTTCT
CTAAACGAACTTTAAAATCTG
TGTAGCTGTCGCTCGGCTG
CATGCCTAGTGCACCTACGC
AGTATAAACAATAATAAATTTT
ACTGTCGTTGACAAGAAACG
AGTAACTCGTCCCTCTTCTG
CAGACTGCTTATTACGCGAC
CGTAAGCTAC…
How BIG is Big Data?
600 million
tweets per DAY
100 hours per
MINUTE
800+ websites
per MINUTE
100 TB of data
uploaded DAILY
3.5 Billion
queries PER DAY
300 Million
Active customers
How BIG is BigData?
ā–Ŗ Better Sensors
ā–Ŗ Higher resolution, Real-time, Diverse measurements, …
ā–Ŗ Faster Communication
ā–Ŗ Network infrastructure, Compression Technologies, …
ā–Ŗ Cheaper Storage
ā–Ŗ Cloud based storage, large warehouses, NoSQL databases
ā–Ŗ Massive Computation
ā–Ŗ Cloud computing, Mapreduce/Hadoop parallel processing paradigms
ā–Ŗ Intelligent Decisions
ā–Ŗ Advances in Machine Learning and Artificial Intelligence
How did we get here?
The Evolution of ā€œComputingā€
Parallel Computing Basics
ā–Ŗ Data Parallelism (distributed computing)
ā–Ŗ Lots of data ! Break it into ā€œchunksā€,
ā–Ŗ Process each ā€œchunkā€ of data in parallel,
ā–Ŗ Combine results from each ā€œchunkā€
ā–Ŗ MAPREDUCE = Data Parallelism
ā–Ŗ Process Parallelism (data flow computing)
ā–Ŗ Lots of stages ! Set up process graph
ā–Ŗ Pass data through all stages
ā–Ŗ All stages running in parallel on different data
ā–Ŗ Assembly line = process parallelism
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
MAPREDUCE 101: A 4-stage ProcessLotsofdata
Shard
1
Shard
N
Shard
2
Reduce
1
Reduce
R
Map
1
Map
2
Map
K
Combine
1
Combine
2
Combine
K
Shuffle
1
Shuffle
2
Shuffle
K
Output
1
Output
R
Each Map
processes
N/K shards
MAPREDUCE 101: An example Task
ā–Ŗ Count total frequency of all words on the web
ā–Ŗ Total number of documents > 20Billion
ā–Ŗ Total number of unique words > 20Million
ā–Ŗ Non-Parallel / Linear Implementation
for each document d on the Web
for each unique word w in d
DocCount w d( )= # times w occurred in d
WebCount w( ) += DocCount w d( )
MAPREDUCE – MAP/COMBINE
Shard1
Key Value
A 10
B 7
C 9
D 3
B 4
Key Value
A 10
B 11
C 9
D 3
Shard2
Key Value
A 3
D 1
C 4
D 9
B 6
Key Value
A 3
B 6
C 4
D 10
Shard3
Key Value
B 3
D 5
C 4
A 6
A 3
Map-1
Map-2
Map-3
Key Value
A 9
B 3
C 4
D 5
Combine-1
Combine-2
Combine-3
MAPREDUCE – Shuffle/Reduce
Key Value
A 10
B 11
C 9
D 3
Key Value
A 3
B 6
C 4
D 10
Key Value
A 9
B 3
C 4
D 5
Key Value
A 10
A 3
A 9
C 9
C 4
C 4
Key Value
B 11
B 6
B 3
D 3
D 10
D 5
Shuffle
1
Shuffle
2
Shuffle
3
Key Value
A 22
C 17
Key Value
B 20
D 18
Reduce
1
Reduce
2
Key Questions in MAPREDUCE
ā–Ŗ Is the task really ā€œdata-parallelizableā€?
ā–Ŗ High dependence tasks (e.g. Fibonacci series)
ā–Ŗ Recursive tasks (e.g. Binary Search)
ā–Ŗ What is the key-value pair output for MAP step?
ā–Ŗ Each map processes only one data record at a time
ā–Ŗ It can generate none, one, or multiple key-value pairs
ā–Ŗ How to combine values of a key in REDUCE step?
ā–Ŗ The key for reduce is same as key for Map output
ā–Ŗ The reduce function must be ā€œorder agnosticā€
Other considerations
ā–Ŗ Reliability/Robustness
ā–Ŗ A processor or disk might go bad during the process
ā–Ŗ Optimization/Efficiency
ā–Ŗ Allocate CPU’s near data shards to reduce network overhead
ā–Ŗ Scale/Parallelism
ā–Ŗ Parallelization linearly proportional to number of machines
ā–Ŗ Simplicity/Usability
ā–Ŗ Just specify the Map task and the Reduce task and be done!
ā–Ŗ Generality
ā–Ŗ Lots of parallelizable tasks can be written in MapReduce
ā–Ŗ With some creativity, many more than you can imagine!
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Similarity between all pairs of docs.
ā–Ŗ Why bother?
ā–Ŗ Document Clustering, Similar document search, etc.
ā–Ŗ Document represented as a ā€œBag-of-Tokensā€
ā–Ŗ A weight associated with each tokens in vocabulary.
ā–Ŗ Most weights are zero – Sparsity
ā–Ŗ Cosine Similarity between two documents
di = w1
i
,w2
i
,...,wT
i
{ }, dj = w1
j
,w2
j
,...,wT
j
{ }
Sim di ,dj( )= wt
i
t=1
T
āˆ‘ Ɨ wt
j
Non-Parallel / Linear Implementation
For each document di
For each document dj ( j > i)
Sim di ,dj( )= wt
i
t=1
T
āˆ‘ Ɨ wt
j
Complexity = O D2
Tσ( )
σ = Sparsity factor =10āˆ’5
= Average Fraction of vocabulary per document
D = O(10B), T = O(10M )
Complexity = O 1020+7āˆ’5
( )= O 1022
( )
Toy Example for doc-doc similarity
A classic ā€œJoinā€
Documents = W, X,Y, Z{ }, Words = a,b,c,d,e{ }
W → a,1 , b,2 , e,5{ }
X → a,3 , c,4 , d,5{ }
Y → b,6 , c,7 , d,8{ }
Z → a,9 , e,10{ }
Input W, X → Sim W, X( )= 3
W,Y → Sim W,Y( )= 12
W,Z → Sim W,Z( )= 59
X,Y → Sim X,Y( )= 68
X,Z → Sim X,Z( )= 27
Y,Z → Sim Y,Z( )= 0
Output
Reverse Indexing to the rescue
First convert the data to reverse index
a→ W,1 , X,3 , Z,9{ }
b→ W,2 , Y,6{ }
c→ X,4 , Y,7{ }
d → X,5 , Y,8{ }
e→ W,5 , Z,10{ }
W → a,1 , b,2 , e,5{ }
X → a,3 , c,4 , d,5{ }
Y → b,6 , c,7 , d,8{ }
Z → a,9 , e,10{ }
Key/Value for the MAP-Step
a→ W,1 , X,3 , Z,9{ }
W, X → 3
W,Z → 9
X,Z → 27
b→ W,2 , Y,6{ }
c→ X,4 , Y,7{ }
W,Y →12
e→ W,5 , Z,10{ }
d → X,5 , Y,8{ }
X,Y → 28
X,Y → 40
W,Z → 50
W, X → 3
W,Y →12
W,Z → 9
W,Z → 50
X,Y → 40
X,Y → 28
X,Z → 27
Value combining in REDUCE-Step
W, X → 3
W,Y →12
W,Z → 9
W,Z → 50
X,Y → 40
X,Y → 28
X,Z → 27
W, X → Sim W, X( )= 3
W,Y → Sim W,Y( )= 12
W,Z → Sim W,Z( )= 59
X,Y → Sim X,Y( )= 68
X,Z → Sim X,Z( )= 27
Y,Z → Sim Y,Z( )= 0
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
assignments ! centers
K-Means Clustering
mk
(t+1)
←
Γn,k
(t )
xn
n=1
N
āˆ‘
Γn,k
(t)
n=1
N
āˆ‘
m1
(t+1)
m2
(t+1)
Γn,2
(t )
= 1
Γn,1
(t )
= 1
m1
(t )
m2
(t )
centers ! assignments
Γn,k
(t+1)
= k == arg min
j=1...K
Ī” x n( )
,mj
(t)
( ){ }( )
K-means clustering 101 – Non-parallel
E-Step – Update assignments from centers


M-Step – Update centers from cluster assignments
Ļ€n
(t)
← arg min
k=1...K
Ī” xn
,mk
(t)
( ){ }
mk
(t+1)
←
Γ πn
(t)
= k( )xn
n=1
N
āˆ‘
Γ πn
(t)
= k( )
n=1
N
āˆ‘
O NKD( ):
N = Number of data points
K = Number of clusters
D = number of dimensions
āŽ§
āŽØ
āŽŖ
āŽ©
āŽŖ
O ND( ):
N = Number of data points
D = number of dimensions
āŽ§
āŽØ
āŽ©
K-Means MapReduce
mk
(t)
{ }k=1
K
Key = πn
(t)
→ Value = xn
Ļ€n
(t)
= arg min
k=1...K
Ī” xn
,mk
(t)
( ){ } mk
(t+1)
←
Γ πn
(t)
= k( )xn
n=1
N
āˆ‘
Γ πn
(t)
= k( )
n=1
N
āˆ‘
mk
(t+1)
{ }k=1
K
Ļ€n
(t)
mk
(t+1)
Map
Shuffle
Reduce
Iterative MapReduce: Update Cluster Centers/iteration
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Cliques: Useful structures in Graphs
• People
• Products
• Movies
• Keywords
• Documents
• Genes
• Neurons
• Co-Social
• Co-purchase
• Co-like
• Co-occurrence
• Similarity
• Co-expressions
• Co-firing
guitarist
rock-music
guitar
song
musician
rock-band
singer
electric-guitar
singing
university
school
college
student
classroom
school-teacher
teacher
teacher-student-relationship
judge
lawsuit
trial
lawyerfalse-persecution
perjury
courtroom
Example Concepts in IMDB
Graph, Cliques, and Maximal Cliques
Clique = a ā€œfully connectedā€ sub-graph
Maximal Clique = a clique with no ā€œSuper-cliqueā€
Finding all Maximal Cliques is NP-hard: O(3n/3)
a
e
b
f
c
g
d
h
Neighborhood of a Clique
a
e
b
f
c
g
d
h
f is connected to BOTH b and c
g is connected to BOTH b and c
N({b,c}) = {f,g}
CLIQUEMAP: Clique (key) ! Its Neighbor (value)
{a} → {b,e}
{a,b} → {e}
{b,c} → { f,g}
{b,c, f } → {g}
{h} → āˆ…
{c,d} → āˆ…
{a,b,e} → āˆ…
{b,c, f ,g} → āˆ…
Growing Cliques from CliqueMap
{b,c, f} → {g}
a
e
b
f
c
g
d
h
{b,c, f} is a clique
g is connected to all of them
āŽ«
āŽ¬
āŽ­
⇒ {b,c, f,g} is a clique
MapReduce for Maximal Cliques
CliqueMap of size k ! size k + 1
{a,b} → {e}
{a,e} → {b}
{b,c} → { f,g}
{b,e} → {a}
{b, f } → {c,g}
{b,g} → {c, f }
{c, f } → {b,g}
{c, g} → {b, f }
{ f, g} → {b,c}
{c,d} → āˆ…
Iteration 2
{a,b,e} → āˆ…
{b,c, f } → {g}
{b,c,g} → { f }
{b, f ,g} → {c}
{c, f ,g} → {b}
Iteration 3
{b,c, f,g} → āˆ…
Iteration 4
{a} → {b,e}
{b} → {a,c,e, f ,g}
{c} → {b,d, f ,g}
{d} → {c}
{e} → {a,b}
{ f } → {b,c,g}
{g} → {b,c, f }
{h} → āˆ…
Iteration 1
Input: Adjacency List
a
e
b
f
c
g
d
h
Key/Value for the MAP-Step
a
e
b
f
c
g
d
h
{a} → {b,e} {a,b} ⇒ {e}
{a,e} ⇒ {b}
{e} → {a,b}
{b} → {a,c,e, f,g}
{a,e} ⇒ {b}
{b,e} ⇒ {a}
{a,b} ⇒ {c,e, f, g}
{b,c} ⇒ {a,e, f, g}
{b,e} ⇒ {a,c, f, g}
{b, f } ⇒ {a,c,e, g}
{b,g} ⇒ {a,c,e, f }
{a,e} ⇒ {b}
{a,e} ⇒ {b}
{a,b} ⇒ {e}
{a,b} ⇒ {c,e, f ,g}
{b,e} ⇒ {a.c, f,g}
{b,e} ⇒ {a}
SHUFFLE
MAP
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
{a,e} ⇒ {b}
{a,e} ⇒ {b}
{a,b} ⇒ {e}
{a,b} ⇒ {c,e, f ,g}
{b,e} ⇒ {a,c, f ,g}
{b,e} ⇒ {a}
SHUFFLE
{a,b} → {e}∩{c,e, f,g} = {e}
{b,e} → {a,c, f,g}∩{a} = {a}
{a,e} → {b}∩{b} = {b}
REDUCE
Reduce = Intersection
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
c,d{ }⇒ b, f ,g{ }
c,d{ }⇒ āˆ…
c{ }→ b,d, f,g{ }
d{ }→ c{ }
b,c{ }⇒ a,e, f,g{ }
b,c{ }⇒ d, f,g{ }
b{ }→ a,c,e, f,g{ }
c{ }→ b,d, f ,g{ }
c,d{ }→
{b, f ,g}āˆ©āˆ… = āˆ…
b,c{ }→
a,e, f ,g{ }∩ d, f ,g{ }
= f ,g{ }
ā€œArt of Thinking Parallelā€ is about
ā–Ŗ Transforming the Input Data appropriately
ā–Ŗ e.g. Reverse Indexing (doc-doc similarity)
ā–Ŗ Breaking the problem into smaller ones
ā–Ŗ e.g. Iterative MapReduce (clustering)
ā–Ŗ Designing the Map step - Key/Value output
ā–Ŗ e.g. CliqueMaps in Maximal Cliques
ā–Ŗ Design the Reduce step – Combine values of key
ā–Ŗ e.g. Intersections in Maximal Cliques

Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar

  • 1.
    MapReduce and the
 art of ā€œThinking Parallelā€ Shailesh Kumar Third Leap, Inc.
  • 2.
    Three I’s ofa great product! Interface Intuitive |Functional | Elegant Infrastructur e Storage |Computation | Network Intelligence Learn |Predict | Adapt | Evolve
  • 3.
    Drowning in Data,Starving for Knowledge ATATTAGGTTTTTACCTACCC AGGAAAAGCCAACCAACCTC GATCTCTTGTAGATCTGTTCT CTAAACGAACTTTAAAATCTG TGTAGCTGTCGCTCGGCTG CATGCCTAGTGCACCTACGC AGTATAAACAATAATAAATTTT ACTGTCGTTGACAAGAAACG AGTAACTCGTCCCTCTTCTG CAGACTGCTTATTACGCGAC CGTAAGCTAC…
  • 4.
    How BIG isBig Data? 600 million tweets per DAY 100 hours per MINUTE 800+ websites per MINUTE 100 TB of data uploaded DAILY 3.5 Billion queries PER DAY 300 Million Active customers How BIG is BigData?
  • 5.
    ā–Ŗ Better Sensors ā–ŖHigher resolution, Real-time, Diverse measurements, … ā–Ŗ Faster Communication ā–Ŗ Network infrastructure, Compression Technologies, … ā–Ŗ Cheaper Storage ā–Ŗ Cloud based storage, large warehouses, NoSQL databases ā–Ŗ Massive Computation ā–Ŗ Cloud computing, Mapreduce/Hadoop parallel processing paradigms ā–Ŗ Intelligent Decisions ā–Ŗ Advances in Machine Learning and Artificial Intelligence How did we get here?
  • 6.
    The Evolution ofā€œComputingā€
  • 7.
    Parallel Computing Basics ā–ŖData Parallelism (distributed computing) ā–Ŗ Lots of data ! Break it into ā€œchunksā€, ā–Ŗ Process each ā€œchunkā€ of data in parallel, ā–Ŗ Combine results from each ā€œchunkā€ ā–Ŗ MAPREDUCE = Data Parallelism ā–Ŗ Process Parallelism (data flow computing) ā–Ŗ Lots of stages ! Set up process graph ā–Ŗ Pass data through all stages ā–Ŗ All stages running in parallel on different data ā–Ŗ Assembly line = process parallelism
  • 8.
    Agenda MAPREDUCE Background Problem 1– Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 9.
    MAPREDUCE 101: A4-stage ProcessLotsofdata Shard 1 Shard N Shard 2 Reduce 1 Reduce R Map 1 Map 2 Map K Combine 1 Combine 2 Combine K Shuffle 1 Shuffle 2 Shuffle K Output 1 Output R Each Map processes N/K shards
  • 10.
    MAPREDUCE 101: Anexample Task ā–Ŗ Count total frequency of all words on the web ā–Ŗ Total number of documents > 20Billion ā–Ŗ Total number of unique words > 20Million ā–Ŗ Non-Parallel / Linear Implementation for each document d on the Web for each unique word w in d DocCount w d( )= # times w occurred in d WebCount w( ) += DocCount w d( )
  • 11.
    MAPREDUCE – MAP/COMBINE Shard1 KeyValue A 10 B 7 C 9 D 3 B 4 Key Value A 10 B 11 C 9 D 3 Shard2 Key Value A 3 D 1 C 4 D 9 B 6 Key Value A 3 B 6 C 4 D 10 Shard3 Key Value B 3 D 5 C 4 A 6 A 3 Map-1 Map-2 Map-3 Key Value A 9 B 3 C 4 D 5 Combine-1 Combine-2 Combine-3
  • 12.
    MAPREDUCE – Shuffle/Reduce KeyValue A 10 B 11 C 9 D 3 Key Value A 3 B 6 C 4 D 10 Key Value A 9 B 3 C 4 D 5 Key Value A 10 A 3 A 9 C 9 C 4 C 4 Key Value B 11 B 6 B 3 D 3 D 10 D 5 Shuffle 1 Shuffle 2 Shuffle 3 Key Value A 22 C 17 Key Value B 20 D 18 Reduce 1 Reduce 2
  • 13.
    Key Questions inMAPREDUCE ā–Ŗ Is the task really ā€œdata-parallelizableā€? ā–Ŗ High dependence tasks (e.g. Fibonacci series) ā–Ŗ Recursive tasks (e.g. Binary Search) ā–Ŗ What is the key-value pair output for MAP step? ā–Ŗ Each map processes only one data record at a time ā–Ŗ It can generate none, one, or multiple key-value pairs ā–Ŗ How to combine values of a key in REDUCE step? ā–Ŗ The key for reduce is same as key for Map output ā–Ŗ The reduce function must be ā€œorder agnosticā€
  • 14.
    Other considerations ā–Ŗ Reliability/Robustness ā–ŖA processor or disk might go bad during the process ā–Ŗ Optimization/Efficiency ā–Ŗ Allocate CPU’s near data shards to reduce network overhead ā–Ŗ Scale/Parallelism ā–Ŗ Parallelization linearly proportional to number of machines ā–Ŗ Simplicity/Usability ā–Ŗ Just specify the Map task and the Reduce task and be done! ā–Ŗ Generality ā–Ŗ Lots of parallelizable tasks can be written in MapReduce ā–Ŗ With some creativity, many more than you can imagine!
  • 15.
    Agenda MAPREDUCE Background Problem 1– Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 16.
    Similarity between allpairs of docs. ā–Ŗ Why bother? ā–Ŗ Document Clustering, Similar document search, etc. ā–Ŗ Document represented as a ā€œBag-of-Tokensā€ ā–Ŗ A weight associated with each tokens in vocabulary. ā–Ŗ Most weights are zero – Sparsity ā–Ŗ Cosine Similarity between two documents di = w1 i ,w2 i ,...,wT i { }, dj = w1 j ,w2 j ,...,wT j { } Sim di ,dj( )= wt i t=1 T āˆ‘ Ɨ wt j
  • 17.
    Non-Parallel / LinearImplementation For each document di For each document dj ( j > i) Sim di ,dj( )= wt i t=1 T āˆ‘ Ɨ wt j Complexity = O D2 Tσ( ) σ = Sparsity factor =10āˆ’5 = Average Fraction of vocabulary per document D = O(10B), T = O(10M ) Complexity = O 1020+7āˆ’5 ( )= O 1022 ( )
  • 18.
    Toy Example fordoc-doc similarity A classic ā€œJoinā€ Documents = W, X,Y, Z{ }, Words = a,b,c,d,e{ } W → a,1 , b,2 , e,5{ } X → a,3 , c,4 , d,5{ } Y → b,6 , c,7 , d,8{ } Z → a,9 , e,10{ } Input W, X → Sim W, X( )= 3 W,Y → Sim W,Y( )= 12 W,Z → Sim W,Z( )= 59 X,Y → Sim X,Y( )= 68 X,Z → Sim X,Z( )= 27 Y,Z → Sim Y,Z( )= 0 Output
  • 19.
    Reverse Indexing tothe rescue First convert the data to reverse index a→ W,1 , X,3 , Z,9{ } b→ W,2 , Y,6{ } c→ X,4 , Y,7{ } d → X,5 , Y,8{ } e→ W,5 , Z,10{ } W → a,1 , b,2 , e,5{ } X → a,3 , c,4 , d,5{ } Y → b,6 , c,7 , d,8{ } Z → a,9 , e,10{ }
  • 20.
    Key/Value for theMAP-Step a→ W,1 , X,3 , Z,9{ } W, X → 3 W,Z → 9 X,Z → 27 b→ W,2 , Y,6{ } c→ X,4 , Y,7{ } W,Y →12 e→ W,5 , Z,10{ } d → X,5 , Y,8{ } X,Y → 28 X,Y → 40 W,Z → 50 W, X → 3 W,Y →12 W,Z → 9 W,Z → 50 X,Y → 40 X,Y → 28 X,Z → 27
  • 21.
    Value combining inREDUCE-Step W, X → 3 W,Y →12 W,Z → 9 W,Z → 50 X,Y → 40 X,Y → 28 X,Z → 27 W, X → Sim W, X( )= 3 W,Y → Sim W,Y( )= 12 W,Z → Sim W,Z( )= 59 X,Y → Sim X,Y( )= 68 X,Z → Sim X,Z( )= 27 Y,Z → Sim Y,Z( )= 0
  • 22.
    Agenda MAPREDUCE Background Problem 1– Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 23.
    assignments ! centers K-MeansClustering mk (t+1) ← Ī“n,k (t ) xn n=1 N āˆ‘ Ī“n,k (t) n=1 N āˆ‘ m1 (t+1) m2 (t+1) Ī“n,2 (t ) = 1 Ī“n,1 (t ) = 1 m1 (t ) m2 (t ) centers ! assignments Ī“n,k (t+1) = k == arg min j=1...K Ī” x n( ) ,mj (t) ( ){ }( )
  • 24.
    K-means clustering 101– Non-parallel E-Step – Update assignments from centers 
 M-Step – Update centers from cluster assignments Ļ€n (t) ← arg min k=1...K Ī” xn ,mk (t) ( ){ } mk (t+1) ← Ī“ Ļ€n (t) = k( )xn n=1 N āˆ‘ Ī“ Ļ€n (t) = k( ) n=1 N āˆ‘ O NKD( ): N = Number of data points K = Number of clusters D = number of dimensions āŽ§ āŽØ āŽŖ āŽ© āŽŖ O ND( ): N = Number of data points D = number of dimensions āŽ§ āŽØ āŽ©
  • 25.
    K-Means MapReduce mk (t) { }k=1 K Key= Ļ€n (t) → Value = xn Ļ€n (t) = arg min k=1...K Ī” xn ,mk (t) ( ){ } mk (t+1) ← Ī“ Ļ€n (t) = k( )xn n=1 N āˆ‘ Ī“ Ļ€n (t) = k( ) n=1 N āˆ‘ mk (t+1) { }k=1 K Ļ€n (t) mk (t+1) Map Shuffle Reduce Iterative MapReduce: Update Cluster Centers/iteration
  • 26.
    Agenda MAPREDUCE Background Problem 1– Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 27.
    Cliques: Useful structuresin Graphs • People • Products • Movies • Keywords • Documents • Genes • Neurons • Co-Social • Co-purchase • Co-like • Co-occurrence • Similarity • Co-expressions • Co-firing
  • 28.
  • 29.
    Graph, Cliques, andMaximal Cliques Clique = a ā€œfully connectedā€ sub-graph Maximal Clique = a clique with no ā€œSuper-cliqueā€ Finding all Maximal Cliques is NP-hard: O(3n/3) a e b f c g d h
  • 30.
    Neighborhood of aClique a e b f c g d h f is connected to BOTH b and c g is connected to BOTH b and c N({b,c}) = {f,g} CLIQUEMAP: Clique (key) ! Its Neighbor (value) {a} → {b,e} {a,b} → {e} {b,c} → { f,g} {b,c, f } → {g} {h} → āˆ… {c,d} → āˆ… {a,b,e} → āˆ… {b,c, f ,g} → āˆ…
  • 31.
    Growing Cliques fromCliqueMap {b,c, f} → {g} a e b f c g d h {b,c, f} is a clique g is connected to all of them āŽ« āŽ¬ āŽ­ ⇒ {b,c, f,g} is a clique
  • 32.
    MapReduce for MaximalCliques CliqueMap of size k ! size k + 1 {a,b} → {e} {a,e} → {b} {b,c} → { f,g} {b,e} → {a} {b, f } → {c,g} {b,g} → {c, f } {c, f } → {b,g} {c, g} → {b, f } { f, g} → {b,c} {c,d} → āˆ… Iteration 2 {a,b,e} → āˆ… {b,c, f } → {g} {b,c,g} → { f } {b, f ,g} → {c} {c, f ,g} → {b} Iteration 3 {b,c, f,g} → āˆ… Iteration 4 {a} → {b,e} {b} → {a,c,e, f ,g} {c} → {b,d, f ,g} {d} → {c} {e} → {a,b} { f } → {b,c,g} {g} → {b,c, f } {h} → āˆ… Iteration 1 Input: Adjacency List a e b f c g d h
  • 33.
    Key/Value for theMAP-Step a e b f c g d h {a} → {b,e} {a,b} ⇒ {e} {a,e} ⇒ {b} {e} → {a,b} {b} → {a,c,e, f,g} {a,e} ⇒ {b} {b,e} ⇒ {a} {a,b} ⇒ {c,e, f, g} {b,c} ⇒ {a,e, f, g} {b,e} ⇒ {a,c, f, g} {b, f } ⇒ {a,c,e, g} {b,g} ⇒ {a,c,e, f } {a,e} ⇒ {b} {a,e} ⇒ {b} {a,b} ⇒ {e} {a,b} ⇒ {c,e, f ,g} {b,e} ⇒ {a.c, f,g} {b,e} ⇒ {a} SHUFFLE MAP
  • 34.
    Value combining inREDUCE-Step a e b f c g d h {a,e} ⇒ {b} {a,e} ⇒ {b} {a,b} ⇒ {e} {a,b} ⇒ {c,e, f ,g} {b,e} ⇒ {a,c, f ,g} {b,e} ⇒ {a} SHUFFLE {a,b} → {e}∩{c,e, f,g} = {e} {b,e} → {a,c, f,g}∩{a} = {a} {a,e} → {b}∩{b} = {b} REDUCE Reduce = Intersection
  • 35.
    Value combining inREDUCE-Step a e b f c g d h c,d{ }⇒ b, f ,g{ } c,d{ }⇒ āˆ… c{ }→ b,d, f,g{ } d{ }→ c{ } b,c{ }⇒ a,e, f,g{ } b,c{ }⇒ d, f,g{ } b{ }→ a,c,e, f,g{ } c{ }→ b,d, f ,g{ } c,d{ }→ {b, f ,g}āˆ©āˆ… = āˆ… b,c{ }→ a,e, f ,g{ }∩ d, f ,g{ } = f ,g{ }
  • 36.
    ā€œArt of ThinkingParallelā€ is about ā–Ŗ Transforming the Input Data appropriately ā–Ŗ e.g. Reverse Indexing (doc-doc similarity) ā–Ŗ Breaking the problem into smaller ones ā–Ŗ e.g. Iterative MapReduce (clustering) ā–Ŗ Designing the Map step - Key/Value output ā–Ŗ e.g. CliqueMaps in Maximal Cliques ā–Ŗ Design the Reduce step – Combine values of key ā–Ŗ e.g. Intersections in Maximal Cliques