Hadoop Performance©MapR Technologies - Confidential        1
Agenda     What is performance? Optimization?     Case 1: Aggregation     Case 2: Recommendations     Case 3: Clusteri...
What is Performance?     Is doing something faster better?     Is it the right task?     Do you have a wide enough view...
Aggregation     Word-count and friends       –   How many times did X occur?       –   How many unique X’s occurred?    ...
Inside Map-Reduce                                             the, 1                         "The time has come," the Walr...
Don’t Do This                                              Daily                                    Raw                   ...
Do This Instead                                    Daily       Weekly                           Raw                       ...
Aggregation     First rule:       –   Don’t read the big input multiple times       –   Compute longer term aggregates fr...
Rank Statistics Can Be Tamed     Approximate quartiles are easily computed       –   (but sorted data is evil)     Appro...
Recommendations     Common patterns in the past may predict common patterns in the      future     People who bought ite...
People who bought …     Key operation is counting number of people who bought x and y       –   for all x’s and all y’s ...
But …     What do we learn from users who buy everything       –   they have no discrimination       –   they are often t...
Also …     What would you learn about a user from purchases       –   1 … 20?       –   21 … 100?       –   101 … 1000?  ...
So …     Cheat!     Downsample every user to at most 1000 interactions       –   most recent       –   most rare       –...
The Fundamental Things Apply     Don’t read the raw data repeatedly     Sessionize and denormalize per hour/day/week    ...
Deployment Matters, Too     For restaurant case, basic recommendation info includes:       –   user x merchant histories ...
Non-Traditional Deployment Demo                                    DEMO©MapR Technologies - Confidential      17
EM Algorithms     Start with random model estimates     Use model estimates to classify examples     Use classified exa...
K-means as EM Algorithm     Assign a random seed to each cluster     Assign points to nearest cluster     Move cluster ...
K-means as Map-Reduce     Assignment of points to cluster is trivially parallel     Computation of new clusters is also ...
But …     With map-reduce, iteration is evil     Starting a program can take 10-30s     Saving data to disk and then im...
Fix #1     Don’t do that!     Use Spark       –   in memory interactive map-reduce       –   100x to 1000x faster       ...
Fix #2     Use a sketch-based algorithm     Do one pass over the data to compute sketch of the data     Cluster the ske...
An Example©MapR Technologies - Confidential   24
The Problem     Spirals are a classic “counter” example for k-means     Classic low dimensional manifold with added nois...
An Example©MapR Technologies - Confidential   26
An Example©MapR Technologies - Confidential   27
The Cluster Proximity Features     Every point can be described by the nearest cluster       –   4.3 bits per point in th...
Lots of Clusters Are Fine©MapR Technologies - Confidential   29
Surrogate Method     Start with sloppy clustering into κ = k log n clusters     Use this sketch as a weighted surrogate ...
Algorithm Costs     O(k d log n) per point per iteration for Lloyd’s algorithm     Number of iterations not well known ...
Algorithm Costs     Surrogate methods       –   fast, sloppy single pass clustering with κ = k log n       –   fast slopp...
Algorithm Costs     How much faster for the sketch phase?       –   take k = 2000, d = 10, n = 100,000       –   k d log ...
Pragmatics     But this requires a fast search internally     Have to cluster on the fly for sketch     Have to guarant...
How It Works     For each point       –   Find approximately nearest centroid (distance = d)       –   If (d > threshold)...
Matrix Decomposition     Many big matrices can often be compressed                                    =     Often used i...
Neighest Neighbor     Very high dimensional vectors can be compressed to 10-100      dimensions with little loss of accur...
Random Projections     Many problems in high dimension can be reduce to low dimension     Reductions with good distance ...
Fundamental Trick     Random orthogonal projection preserves action of A                                    Ax - Ay » Q A...
Projection Search                                         total ordering!©MapR Technologies - Confidential   40
LSH Bit-match Versus Cosine                       1                     0.8                     0.6                     0....
But How?                  Y = AW                  Q1 R = Y                  B = Q1T A                  LQ2 = B            ...
Summary     Don’t repeat big scans       –   Cascade aggregations       –   Compute several aggregates at once     Use a...
Contact Me!     We’re hiring at MapR in US and Europe     Come get the slides at           http://www.mapr.com/company/e...
Upcoming SlideShare
Loading in...5
×

Cmu Lecture on Hadoop Performance

1,764

Published on

A lecture describing several ways to make Hadoop programs go faster.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,764
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
42
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Cmu Lecture on Hadoop Performance

  1. 1. Hadoop Performance©MapR Technologies - Confidential 1
  2. 2. Agenda What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition©MapR Technologies - Confidential 2
  3. 3. What is Performance? Is doing something faster better? Is it the right task? Do you have a wide enough view? What is the right performance metric?©MapR Technologies - Confidential 3
  4. 4. Aggregation Word-count and friends – How many times did X occur? – How many unique X’s occurred? Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates Rank-based statistics do not permit decomposition – Avoid them – Use approximations©MapR Technologies - Confidential 4
  5. 5. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax [1,5,2] has, come, 6 come, 1 the, [1,2,1] has, 8 … time, [10,1,3 the, 4 ] time, 14 Input Map Combine Shuffle … Reduce … Output and sort Reduce©MapR Technologies - Confidential 5 5
  6. 6. Don’t Do This Daily Raw Weekly Monthly©MapR Technologies - Confidential 6
  7. 7. Do This Instead Daily Weekly Raw Monthly©MapR Technologies - Confidential 7
  8. 8. Aggregation First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time©MapR Technologies - Confidential 8
  9. 9. Rank Statistics Can Be Tamed Approximate quartiles are easily computed – (but sorted data is evil) Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on Approximate top-40 also possible©MapR Technologies - Confidential 9
  10. 10. Recommendations Common patterns in the past may predict common patterns in the future People who bought item x also bought item y But also, people who bought Chinese food in the past, … Or people in SoMa really liked this restaurant in the past©MapR Technologies - Confidential 10
  11. 11. People who bought … Key operation is counting number of people who bought x and y – for all x’s and all y’s The raw problem appears to be O(N^3) At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N Scalable problems must be O(N)©MapR Technologies - Confidential 11
  12. 12. But … What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing©MapR Technologies - Confidential 12
  13. 13. Also … What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞? What about learning about an item? – how many people do we need to see before we understand the item?©MapR Technologies - Confidential 13
  14. 14. So … Cheat! Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest Now k_max ≤ 1000©MapR Technologies - Confidential 14
  15. 15. The Fundamental Things Apply Don’t read the raw data repeatedly Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences©MapR Technologies - Confidential 15
  16. 16. Deployment Matters, Too For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix These can all be stored and accessed using text retrieval techniques Fast deployment using mirrors and NFS (not standard Hadoop)©MapR Technologies - Confidential 16
  17. 17. Non-Traditional Deployment Demo DEMO©MapR Technologies - Confidential 17
  18. 18. EM Algorithms Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …©MapR Technologies - Confidential 18
  19. 19. K-means as EM Algorithm Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster … and so on …©MapR Technologies - Confidential 19
  20. 20. K-means as Map-Reduce Assignment of points to cluster is trivially parallel Computation of new clusters is also parallel Moving points to averages is ideal for map-reduce©MapR Technologies - Confidential 20
  21. 21. But … With map-reduce, iteration is evil Starting a program can take 10-30s Saving data to disk and then immediately reading from disk is silly Input might even fit in cluster memory©MapR Technologies - Confidential 21
  22. 22. Fix #1 Don’t do that! Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce… Use GraphLab – Like BSP without the speed brakes – 100x faster©MapR Technologies - Confidential 22
  23. 23. Fix #2 Use a sketch-based algorithm Do one pass over the data to compute sketch of the data Cluster the sketch Done. With good theoretic bounds on accuracy Speedup of 3000x or more©MapR Technologies - Confidential 23
  24. 24. An Example©MapR Technologies - Confidential 24
  25. 25. The Problem Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise But clustering still makes modeling work well©MapR Technologies - Confidential 25
  26. 26. An Example©MapR Technologies - Confidential 26
  27. 27. An Example©MapR Technologies - Confidential 27
  28. 28. The Cluster Proximity Features Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation©MapR Technologies - Confidential 28
  29. 29. Lots of Clusters Are Fine©MapR Technologies - Confidential 29
  30. 30. Surrogate Method Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time©MapR Technologies - Confidential 30
  31. 31. Algorithm Costs O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption©MapR Technologies - Confidential 31
  32. 32. Algorithm Costs Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice©MapR Technologies - Confidential 32
  33. 33. Algorithm Costs How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal©MapR Technologies - Confidential 33
  34. 34. Pragmatics But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity©MapR Technologies - Confidential 34
  35. 35. How It Works For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 35
  36. 36. Matrix Decomposition Many big matrices can often be compressed = Often used in recommendations©MapR Technologies - Confidential 36
  37. 37. Neighest Neighbor Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy Fast search algorithms work up to dimension 50-100, don’t work above that©MapR Technologies - Confidential 37
  38. 38. Random Projections Many problems in high dimension can be reduce to low dimension Reductions with good distance approximation are available Surprisingly, these methods can be done using random vectors©MapR Technologies - Confidential 38
  39. 39. Fundamental Trick Random orthogonal projection preserves action of A Ax - Ay » Q Ax - Q Ay T T©MapR Technologies - Confidential 39
  40. 40. Projection Search total ordering!©MapR Technologies - Confidential 40
  41. 41. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is©MapR Technologies - Confidential 41
  42. 42. But How? Y = AW Q1 R = Y B = Q1T A LQ2 = B USV = L T (Q1U) S (Q2V ) » A T©MapR Technologies - Confidential 42
  43. 43. Summary Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections©MapR Technologies - Confidential 43
  44. 44. Contact Me! We’re hiring at MapR in US and Europe Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12 Get the code at https://github.com/tdunning Contact me at tdunning@maprtech.com or @ted_dunning©MapR Technologies - Confidential 44
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×