Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining of massive datasets

1,192 views

Published on

Slides from my talk at DDD Dundee 2014 on some approaches that are used in mining of massive datasets.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Mining of massive datasets

  1. 1. Mining of Massive Datasets Ashic Mahtab @ashic www.heartysoft.com
  2. 2. Stream Processing
  3. 3. Stream Processing  Have I already processed this?  How many distinct queries were made?  How many hits did I get?
  4. 4. Stream Processing – Bloom Filters  Guaranteed detection of negatives.  Possible false positive.
  5. 5. Stream Processing – Bloom Filters  Have a collection of hash functions (h1, h2, h3…).  For an input, run the hash functions. Map to bit array.  If all bits are lit in working store, might have been processed (possibility of false positives).  If any of the lit bits in hashed array are not lit in working store, need to process this. (Guaranteed…no false negatives).
  6. 6. Stream Processing – Bloom Filters 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 Input 1: “Foo” hashes to: 1 0 0 1 1 0 0 0 0 0 Input 2: “Bar” hashes to: 1 0 1 1 1 0 0 0 0 0
  7. 7. Stream Processing – Bloom Filters  Not just for streams (everything is a stream, right?)  Cassandra uses bloom filters to detect if some data is in a low level storage file.
  8. 8. Map Reduce  A little smarts goes a l-o-o-o-n-g way.
  9. 9. Map Reduce – Multiway Joins  R join S join T  size(R) = r, size(S) = s, size(T) = t  Probability of match for R and S = p  Probability of match for S and T = p  Which do we join first?
  10. 10. Map Reduce – Multiway Joins  R (A, B) join S(B, C) join T(C, D)  size(R) = r, size(S) = s, size(T) = t  Probability of match for R and S = p  Probability of match for S and T = p  Communication cost: * If we join R and S first: O(r + s + t + pst) * If we join S and T first: O(r + s + t + prs)
  11. 11. Map Reduce – Multiway Joins  Can we do better?
  12. 12. Map Reduce – Multiway Joins  Hash B to b buckets, c to C buckets.  bc = k  Cost ~ r + 2s + t + 2 * sqrt(krt) Usually, can neglect r + t compared to the k term. So, 2s + 2*sqrt(krt) [Single MR job]
  13. 13. Map Reduce – Multiway Joins  Hash B to b buckets, c to C buckets.  bc = k  Cost ~ r + 2s + t + 2 * sqrt(krt) Usually, can neglect r + t compared to the k term. So, 2s + 2*sqrt(krt) [Single MR job]  vs (r + s + t + prs) [Two MR jobs]
  14. 14. Map Reduce – Multiway Joins  So…is this always better?
  15. 15. Map Reduce – Complexity  Replication Rate (r): Number of outputs by all Map tasks / number of inputs  Reducer Size (q): Max number of items per key at reducers  p = number of inputs  For nxn: qr >= 2n^2 r >= p / q
  16. 16. Map Reduce – Matrix Multiplication  Approach 1  Matrix M, N  M(i, j), N(j, k)  Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))  Reduce1: for each key, output ((i, k), mij*njk)  Map2: Identity  Reduce2: For each key, (i, k) get the sum of values.
  17. 17. Map Reduce – Matrix Multiplication  Approach 2  One step:  Map: For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M  Reduce: For each key (i, k), multiple values, and sum.
  18. 18. Map Reduce – Matrix Multiplication  Approach 3  Two steps again.
  19. 19. Map Reduce – Matrix Multiplication  One pass: (4n^4) / q  Two pass: (4n^3) / sqrt(q)
  20. 20. Similarity - Shingling  “abcdef” -> [“abc”, “bcd”, “cde”…]  Jaccard similarity - > N(intersection) / N(union)
  21. 21. Similarity - Shingling  “abcdef” -> [“abc”, “bcd”, “cde”…]  Jaccard similarity - > N(intersection) / N(union)  Problem?  Size
  22. 22. Similarity - Minhashing
  23. 23. Similarity - Minhashing h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
  24. 24. Similarity - Minhashing Problem? h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
  25. 25. Similarity – Minhash Signatures
  26. 26. Similarity – Minhash Signatures Problem? Still can’t find pairs with greatest similarity efficiently
  27. 27. Similarity – LSH for Minhash Signatures
  28. 28. Clustering – Hierarchical
  29. 29. Clustering – K Means 1. Pick k points (centroids) 2. Assign points to clusters 3. Shift centroids to “centre”. 4. Repeat
  30. 30. Clustering – K Means
  31. 31. Clustering – FBR • 3 sets – Discard, Compressed and Retained • First two have summaries. N, sum per dimension, sum of squares per dimension • High dimensional Euclidian space Mahalanobis Distance
  32. 32. Clustering – CURE
  33. 33. Clustering – CURE • Sample. Run clustering on sample. • Pick “representatives” from each sample. • Move representatives about 20% or so to the centre. • Merge of close.
  34. 34. Dimentionality Reduction
  35. 35. Dimentionality Reduction
  36. 36. Dimentionality Reduction - SVD
  37. 37. Dimentionality Reduction - SVD
  38. 38. Dimensionality Reduction - CUR  SVD results in U and V being dense, even when M is sparse.  O(n^3)
  39. 39. Dimensionality Reduction - CUR  Choose r.  Choose r rows and r columns of M.  Intersection is W.  Run SVD on W (much smaller than M). W = XΣY’  Compute Σ+, the Moore-Penrose pseudoinverse of Σ.  Then, U = Y * (Σ+)^2 * X’
  40. 40. Dimensionality Reduction – CUR Choosing Rows and Columns  Random, but with bias for importance.  (Frobenius Norm)^2  Probability of picking a row or column: Sum of squares for row or column / Sum of squares of all elements
  41. 41. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.
  42. 42. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.
  43. 43. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.  Compute pseudo-inverse as before, but transpose the result.
  44. 44. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.  Compute pseudo-inverse as before, but transpose the result.
  45. 45. Thanks  Mining of Massive Datasets Leskovec, Rajaraman, Ullman Coursera / Stanford Course Book: http://www.mmds.org/ [free]

×