Upcoming SlideShare
×

07 2

634 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
634
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
19
0
Likes
1
Embeds 0
No embeds

No notes for slide

07 2

1. 1. Large-scale Data Mining:MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
2. 2. Part 2:Mining using MapReduce  Mining algorithms using MapReduce  Information retrieval  Graph algorithms: PageRank  Clustering: Canopy clustering, KMeans  Classification: kNN, Naï Bayes ve  MapReduce Mining Summary2
3. 3. MapReduce Interface and Data Flow  Map: (K1, V1)  list(K2, V2)  Combine: (K2, list(V2))  list(K2, V2)  Partition: (K2, V2)  reducer_id  Reduce: (K2, list(V2))  list(K3, V3)id, doc list(w, id) list(unique_w, id) Host 1 Host 3 Map Combine w1, list(id) w1, list(unique_id) Reduce Map Combine Host 4 Host 2 Map Combine Map Combine w2, list(id) Reduce w2, list(unique_id) partition3
4. 4. Information retrieval using MapReduce4
5. 5. IR: Distributed Grep  Find the doc_id and line# of a matching pattern  Map: (id, doc) list(id, line#)  Reduce: None Grep “data mining” Docs Output 1 2 Map1 <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012>5
6. 6. IR: URL Access Frequency  Map: (null, log)  (URL, 1)  Reduce: (URL, 1)  (URL, total_count) Logs Map output Reduce output <u1,1> Map1 <u1,1> <u1,2> <u2,1> Reduce <u2,1> Map2 <u3,2> <u3,1> Map3 <u3,1> Also described in Part 16
7. 7. IR: Reverse Web-Link Graph  Map: (null, page)  (target, source)  Reduce: (target, source)  (target, list(source)) Pages Map output Reduce output Map1 <t1,s2> <t1,[s2]> Map2 <t2,s3> Reduce <t2,[s3,s5]> <t2,s5> <t3,[s5]> Map3 <t3,s5> It is the same as matrix transpose7
8. 8. IR: Inverted Index  Map: (id, doc)  list(word, id)  Reduce: (word, list(id))  (word, list(id)) Doc Map output Reduce output Map1 <w1,1> <w1,[1,5]> Map2 <w2,2> Reduce <w2,[2]> <w3,3> <w3,[3]> Map3 <w1,5>8
9. 9. Graph mining using MapReduce9
10. 10. PageRank  PageRank vector q is defined as 1 2 where T 1¡ c q = cA q + N e 3 4 Browsing Teleporting 0 1 A is the source-by-destination 0 1 1 1 adjacency matrix, B 0 0 1 1 C A=B @ 0 0 C 0 1 A  e is all one vector. 0 0 1 0  N is the number of nodes  c is the weight between 0 and 1 (eg.0.85)  PageRank indicates the importance of a page.  Algorithm: Iterative powering for finding the first eigen-vector10
11. 11. MapReduce: PageRank PageRank Map() 1 2  Input: key = page x, value = (PageRank qx, links[y1…ym])  Output: key = page x, value = partialx 3 4 1. Emit(x, 0) //guarantee all pages will be emitted Map: distribute PageRank qi 2. For each outgoing link yi: q3 • Emit(yi, qx/m) q1 q2 q4 PageRank Reduce()  Input: key = page x, value = the list of [partialx]  Output: key = page x, value = PageRank qx 1. qx = 0 q1 q2 q3 q4 2. For each partial value d in the list: Reduce: update new PageRank • qx += d 3. qx = cqx+ (1-c)/N 4. Emit(x, qx) Check out Kang et al ICDM’0911
12. 12. Clustering using MapReduce12
13. 13. Canopy: single-pass clustering Canopy creation  Construct overlapping clusters – canopies  Make sure no two canopies with too much overlaps  Key: no canopy centers are too close to each other C1 C3 T1 C2 C4 T2 overlapping clusters too much overlap two thresholds T1>T2 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0013
14. 14. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0014
15. 15. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) T1 T2 – For each canopy c: Strongly marked  if dist(p,c)< T1: c.add(p) points  if dist(p,c)< T2: strongBound = true C1 C2 – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0015
16. 16. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue QOther pointsin the cluster  While (Q is not empty) – p = dequeue(Q) – For each canopy c: Canopy center Strongly marked  if dist(p,c)< T1: c.add(p) points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0016
17. 17. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0017
18. 18. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0018
19. 19. MapReduce - Canopy Map() Canopy creation Map()  Input: A set of points P, threshold T1, T2  Output: key = null; value = a list of local canopies (total, count)  For each p in P:  For each canopy c: • if dist(p,c)< T1 then c.total+=p, c.count++; Map1 • if dist(p,c)< T2 then strongBound = true Map2 • If not strongBound then create canopy at p Close()  For each canopy c: • Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0019
20. 20. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Map1 results c.count++; Map2 results • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() For simplicity we assume only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0020
21. 21. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Reducer results c.count++; • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() Remark: This assumes only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0021
22. 22. Clustering Assignment Clustering assignment  For each point p  Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0022
23. 23. MapReduce: Cluster Assignment Cluster assignment Map()  Input: Point p; cluster centriods  Output:  Output key = cluster id  Output value =point id  currentDist = inf  For each cluster centroids c Canopy center  If dist(p,c)<currentDist then bestCluster=c, currentDist=dist(p,c);  Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’0023
24. 24. KMeans: Multi-pass clustering Traditional AssignCluster()AssignCluster():• For each point p Kmeans () Assign p the closest c  While not converge:  AssignCluster() UpdateCentroid()UpdateCentroids ():  UpdateCentroids() For each cluster Update cluster center24
25. 25. MapReduce – KMeans KmeansIter() Map(p) // Assign Cluster  For c in clusters:  If dist(p,c)<minDist, then minC=c, minDist = dist(p,c)  Emit(minC.id, (p, 1)) Reduce() //Update Centroids  For all values (p, c) :  total += p; count += c; Map1  Emit(key, (total, count)) Map2 Initial centroids25
26. 26. MapReduce – KMeans KmeansIter() Map(p) // Assign Cluster Map: assign each p 1 3  For c in clusters: to closest centroids  If dist(p,c)<minDist, then minC=c, minDist = dist(p,c) Reduce: update each  Emit(minC.id, (p, 1)) 2 centroid with its new 4 Reduce() //Update Centroids location (total, count)  For all values (p, c) :  total += p; count += c; Map1  Emit(key, (total, count)) Reduce1 Map2 Reduce2 Initial centroids26
27. 27. Classification using MapReduce27
28. 28. MapReduce kNN K=3 Map()  Input: 1 0 – All points 0 1 1 1 1 – query point p 0 1  Output: k nearest neighbors (local) 1  Emit the k closest points to p 1 0 0 0 1 0 0 Reduce()  Input: Map1 – Key: null; values: local neighbors – query point p Map2 Reduce  Output: k nearest neighbors (global) Query point  Emit the k closest points to p among all local neighbors28
29. 29. Naï Bayes ve Q  Formulation: P (cjd) / P (c) w2d P (wjc)  Parameter estimation c is a class label, d is a doc, w is a word  Class ^ prior: P (c) = Nc where Nc is #docs in c, N is #docs N  Conditional probability: ^ Tcw P (wjc) = P Tcw is # of occurrences of w in class c w0 Tcw0d1d2 c1: T1w Goals:d3 N1=3 1. total number of docs Nd4 2. number of docs in c:Nc T2w 3. word count histogram in c:Tcwd5 c2:d6 N2=4 4. total word count in c:Tcw’d7 Term vector30
30. 30. MapReduce: Naï Bayes ve ClassPrior() Map(doc): Goals: Emit(class_id, (doc_id, doc_length) Combine()/Reduce() 1. total number of docs N  Nc = 0; sTcw = 0 2. number of docs in c:Nc 3. word count histogram in c:Tcw  For each doc_id: 4. total word count in c:Tcw’  Nc++; sTcw +=doc_length  Emit(c, Nc) ConditionalProbability() Naï Bayes can be implemented ve Map(doc): using MapReduce jobs of histogram  For each word w in doc:  Emit(pair(c, w) , 1) computation Combine()/Reduce()  Tcw = 0  For each value v: Tcw += v  Emit(pair(c, w) , Tcw)31
31. 31. MapReduce Mining Summary32
32. 32. Taxonomy of MapReduce algorithms One Iteration Multiple Iterations Not good for MapReduce Clustering Canopy KMeans Classification Naï Bayes, kNN ve Gaussian Mixture SVM Graphs PageRank Information Inverted Index Topic modeling Retrieval (PLSI, LDA)  One-iteration algorithms are perfect fits  Multiple-iteration algorithms are OK fits  but small shared info have to be synchronized across iterations (typically through filesytem)  Some algorithms are not good for MapReduce framework  Those algorithms typically require large shared info with a lot of synchronization.  Traditional parallel framework like MPI is better suited for those.33
33. 33. MapReduce for machine learning algorithms  The key is to convert into summation form (Statistical Query model [Kearns’94])  y=  f(x) where f(x) corresponds to map(),  corresponds to reduce().  Naï Bayes ve  MR job: P(c)  MR job: P(w|c)  Kmeans  MR job: split data into subgroups and compute partial sum in Map(), then sum up in Reduce()34 Map-Reduce for Machine Learning on Multicore [NIPS’06]
34. 34. Machine learning algorithms using MapReduce  Linear Regression: y = (XT X)¡ 1 XT y ^ where X 2 Rm£n and y 2 Rm P T  MR job1: A = X X = i xi xi T T P  MR job2: b = X y = i xi y(i)  Finally, solve A^ = b y  Locally Weighted Linear Regression: P  MR job1: A = i wi xi xi T P  MR job2: b = i wi xi y(i)35 Map-Reduce for Machine Learning on Multicore [NIPS’06]
35. 35. Machine learning algorithms using MapReduce (cont.)  Logistic Regression  Neural Networks  PCA  ICA  EM for Gaussian Mixture Model36 Map-Reduce for Machine Learning on Multicore [NIPS’06]
36. 36. MapReduce Mining Resources37
37. 37. Mahout: Hadoop data mining library  Mahout: http://lucene.apache.org/mahout/  scalable data mining libraries: mostly implemented in Hadoop  Data structure for vectors and matrices  Vectors  Dense vectors as double[]  Sparse vectors as HashMap<Integer, Double>  Operations: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross  Matrices  Dense matrix as a double[][]  SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the rows or columns of the matrix in a SparseVector  SparseMatrix as a HashMap<Integer, Vector>  Operations: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times,38 transpose, toArray, viewPart and zSum
38. 38. MapReduce Mining Papers  [Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore  General framework under MapReduce  [Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map- Reduce  Co-clustering  [Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations  Graph algorithms  [Das et al. WWW’07] Google news personalization: scalable online collaborative filtering  PLSI EM  [Grossman+Gu KDD’08] Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. KDD 2008  Alternative to Hadoop which supports wide-area data collection and distribution.39
39. 39. Summary: algorithms  Best for MapReduce:  Single pass, keys are uniformly distributed.  OK for MapReduce:  Multiple pass, intermediate states are small  Bad for MapReduce  Key distribution is skewed  Fine-grained synchronization is required.40
40. 40. Large-scale Data Mining:MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook