10. PageRank
PageRank vector q is defined as 1 2
where T 1¡ c
q = cA q + N e 3 4
Browsing Teleporting
0 1
A is the source-by-destination 0 1 1 1
adjacency matrix, B 0 0 1 1 C
A=B @ 0 0
C
0 1 A
e is all one vector. 0 0 1 0
N is the number of nodes
c is the weight between 0 and 1 (eg.0.85)
PageRank indicates the importance of a page.
Algorithm: Iterative powering for finding the first
eigen-vector
10
11. MapReduce: PageRank
PageRank Map()
1 2
Input: key = page x,
value = (PageRank qx, links[y1…ym])
Output: key = page x, value = partialx
3 4 1. Emit(x, 0) //guarantee all pages will be emitted
Map: distribute PageRank qi 2. For each outgoing link yi:
q3 • Emit(yi, qx/m)
q1 q2 q4
PageRank Reduce()
Input: key = page x, value = the list of [partialx]
Output: key = page x, value = PageRank qx
1. qx = 0
q1 q2 q3 q4
2. For each partial value d in the list:
Reduce: update new PageRank • qx += d
3. qx = cqx+ (1-c)/N
4. Emit(x, qx)
Check out Kang et al ICDM’09
11
13. Canopy: single-pass clustering
Canopy creation
Construct overlapping clusters – canopies
Make sure no two canopies with too much overlaps
Key: no canopy centers are too close to each other
C1 C3 T1
C2 C4
T2
overlapping clusters too much overlap two thresholds
T1>T2
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
13
14. Canopy creation
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
14
15. Canopy creation
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
T1 T2
– For each canopy c:
Strongly marked if dist(p,c)< T1: c.add(p)
points
if dist(p,c)< T2: strongBound = true
C1 C2 – If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
15
16. Canopy creation
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
Other points
in the cluster While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
Canopy center
Strongly marked if dist(p,c)< T1: c.add(p)
points
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
16
17. Canopy creation
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
17
18. Canopy creation
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
Put all points into a queue Q
While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
if dist(p,c)< T1: c.add(p)
if dist(p,c)< T2: strongBound = true
– If not strongBound: create canopy at p
For all canopy c:
– Set centroid to the mean of all points in c
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
18
19. MapReduce - Canopy Map()
Canopy creation Map()
Input: A set of points P, threshold T1, T2
Output: key = null; value = a list of local canopies
(total, count)
For each p in P:
For each canopy c:
• if dist(p,c)< T1 then c.total+=p, c.count++;
Map1 • if dist(p,c)< T2 then strongBound = true
Map2 • If not strongBound then create canopy at p
Close()
For each canopy c:
• Emit(null, (total, count))
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
19
20. MapReduce - Canopy Reduce()
Reduce()
Input: key = null; input values (total, count)
Output: key = null; value = cluster centroids
For each intermediate values
p = total/count
For each canopy c:
• if dist(p,c)< T1 then c.total+=p,
Map1 results c.count++;
Map2 results
• if dist(p,c)< T2 then strongBound = true
If not strongBound then create canopy at p
Close()
For simplicity we assume only one reducer
For each canopy c: emit(null, c.total/c.count)
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
20
21. MapReduce - Canopy Reduce()
Reduce()
Input: key = null; input values (total, count)
Output: key = null; value = cluster centroids
For each intermediate values
p = total/count
For each canopy c:
• if dist(p,c)< T1 then c.total+=p,
Reducer results c.count++;
• if dist(p,c)< T2 then strongBound = true
If not strongBound then create canopy at p
Close()
Remark: This assumes only one reducer
For each canopy c: emit(null, c.total/c.count)
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
21
22. Clustering Assignment
Clustering assignment
For each point p
Assign p to the closest
canopy center
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
22
23. MapReduce: Cluster Assignment
Cluster assignment Map()
Input: Point p; cluster centriods
Output:
Output key = cluster id
Output value =point id
currentDist = inf
For each cluster centroids c
Canopy center
If dist(p,c)<currentDist
then bestCluster=c, currentDist=dist(p,c);
Emit(bestCluster, p)
Results can be directly written back to HDFS without a reducer. Or an
identity reducer can be applied to have output sorted on cluster id.
McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
23
24. KMeans: Multi-pass clustering
Traditional AssignCluster()
AssignCluster():
• For each point p
Kmeans ()
Assign p the closest c
While not converge:
AssignCluster() UpdateCentroid()
UpdateCentroids ():
UpdateCentroids()
For each cluster
Update cluster center
24
25. MapReduce – KMeans
KmeansIter()
Map(p) // Assign Cluster
For c in clusters:
If dist(p,c)<minDist,
then minC=c, minDist = dist(p,c)
Emit(minC.id, (p, 1))
Reduce() //Update Centroids
For all values (p, c) :
total += p; count += c;
Map1 Emit(key, (total, count))
Map2
Initial centroids
25
26. MapReduce – KMeans
KmeansIter()
Map(p) // Assign Cluster
Map: assign each p
1
3 For c in clusters:
to closest centroids
If dist(p,c)<minDist,
then minC=c, minDist = dist(p,c)
Reduce: update each
Emit(minC.id, (p, 1))
2 centroid with its new
4 Reduce() //Update Centroids
location (total, count)
For all values (p, c) :
total += p; count += c;
Map1 Emit(key, (total, count))
Reduce1
Map2
Reduce2
Initial centroids
26
28. MapReduce kNN
K=3 Map()
Input:
1 0 – All points
0 1 1
1 1 – query point p
0 1 Output: k nearest neighbors (local)
1
Emit the k closest points to p
1 0
0 0 1
0 0
Reduce()
Input:
Map1 – Key: null; values: local neighbors
– query point p
Map2 Reduce Output: k nearest neighbors (global)
Query point Emit the k closest points to p among all
local neighbors
28
29. Naï Bayes
ve
Q
Formulation: P (cjd) / P (c) w2d P (wjc)
Parameter estimation c is a class label, d is a doc, w is a word
Class ^
prior: P (c) = Nc where Nc is #docs in c, N is #docs
N
Conditional probability:
^ Tcw
P (wjc) = P Tcw is # of occurrences of w in class c
w0 Tcw0
d1
d2 c1: T1w Goals:
d3 N1=3 1. total number of docs N
d4 2. number of docs in c:Nc
T2w 3. word count histogram in c:Tcw
d5 c2:
d6 N2=4
4. total word count in c:Tcw’
d7
Term vector
30
30. MapReduce: Naï Bayes
ve
ClassPrior()
Map(doc):
Goals: Emit(class_id, (doc_id, doc_length)
Combine()/Reduce()
1. total number of docs N
Nc = 0; sTcw = 0
2. number of docs in c:Nc
3. word count histogram in c:Tcw For each doc_id:
4. total word count in c:Tcw’ Nc++; sTcw +=doc_length
Emit(c, Nc)
ConditionalProbability()
Naï Bayes can be implemented
ve Map(doc):
using MapReduce jobs of histogram For each word w in doc:
Emit(pair(c, w) , 1)
computation
Combine()/Reduce()
Tcw = 0
For each value v: Tcw += v
Emit(pair(c, w) , Tcw)
31
32. Taxonomy of MapReduce algorithms
One Iteration Multiple Iterations Not good for
MapReduce
Clustering Canopy KMeans
Classification Naï Bayes, kNN
ve Gaussian Mixture SVM
Graphs PageRank
Information Inverted Index Topic modeling
Retrieval (PLSI, LDA)
One-iteration algorithms are perfect fits
Multiple-iteration algorithms are OK fits
but small shared info have to be synchronized across iterations (typically through
filesytem)
Some algorithms are not good for MapReduce framework
Those algorithms typically require large shared info with a lot of synchronization.
Traditional parallel framework like MPI is better suited for those.
33
33. MapReduce for machine learning algorithms
The key is to convert into summation form
(Statistical Query model [Kearns’94])
y= f(x) where f(x) corresponds to map(),
corresponds to reduce().
Naï Bayes
ve
MR job: P(c)
MR job: P(w|c)
Kmeans
MR job: split data into subgroups and compute partial
sum in Map(), then sum up in Reduce()
34 Map-Reduce for Machine Learning on Multicore [NIPS’06]
34. Machine learning algorithms using MapReduce
Linear Regression: y = (XT X)¡ 1 XT y
^
where X 2 Rm£n and y 2 Rm
P T
MR job1: A = X X = i xi xi T
T
P
MR job2: b = X y = i xi y(i)
Finally, solve A^ = b
y
Locally Weighted Linear Regression:
P
MR job1: A = i wi xi xi T
P
MR job2: b = i wi xi y(i)
35 Map-Reduce for Machine Learning on Multicore [NIPS’06]
35. Machine learning algorithms using MapReduce
(cont.)
Logistic Regression
Neural Networks
PCA
ICA
EM for Gaussian Mixture Model
36 Map-Reduce for Machine Learning on Multicore [NIPS’06]
37. Mahout: Hadoop data mining library
Mahout: http://lucene.apache.org/mahout/
scalable data mining libraries: mostly implemented in Hadoop
Data structure for vectors and matrices
Vectors
Dense vectors as double[]
Sparse vectors as HashMap<Integer, Double>
Operations: assign, cardinality, copy, divide, dot, get,
haveSharedCells, like, minus, normalize, plus, set, size, times,
toArray, viewPart, zSum and cross
Matrices
Dense matrix as a double[][]
SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the
rows or columns of the matrix in a SparseVector
SparseMatrix as a HashMap<Integer, Vector>
Operations: assign, assignColumn, assignRow, cardinality, copy,
divide, get, haveSharedCells, like, minus, plus, set, size, times,
38 transpose, toArray, viewPart and zSum
38. MapReduce Mining Papers
[Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore
General framework under MapReduce
[Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map-
Reduce
Co-clustering
[Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System
- Implementation and Observations
Graph algorithms
[Das et al. WWW’07] Google news personalization: scalable online
collaborative filtering
PLSI EM
[Grossman+Gu KDD’08] Data Mining Using High Performance Data
Clouds: Experimental Studies Using Sector and Sphere. KDD 2008
Alternative to Hadoop which supports wide-area data collection and
distribution.
39
39. Summary: algorithms
Best for MapReduce:
Single pass, keys are uniformly distributed.
OK for MapReduce:
Multiple pass, intermediate states are small
Bad for MapReduce
Key distribution is skewed
Fine-grained synchronization is required.
40
40. Large-scale Data Mining:
MapReduce and Beyond
Part 2: Algorithms
Spiros Papadimitriou, IBM Research
Jimeng Sun, IBM Research
Rong Yan, Facebook