CMU Lecture on Hadoop Performance

473 views

Published on

A lecture by Ted Dunning describing several ways to make Hadoop programs go faster.

Published in: Technology, Education
  • Be the first to comment

CMU Lecture on Hadoop Performance

  1. 1. 1©MapR Technologies - Confidential Hadoop Performance
  2. 2. 2©MapR Technologies - Confidential Agenda  What is performance? Optimization?  Case 1: Aggregation  Case 2: Recommendations  Case 3: Clustering  Case 4: Matrix decomposition
  3. 3. 3©MapR Technologies - Confidential What is Performance?  Is doing something faster better?  Is it the right task?  Do you have a wide enough view?  What is the right performance metric?
  4. 4. 4©MapR Technologies - Confidential Aggregation  Word-count and friends – How many times did X occur? – How many unique X’s occurred?  Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates  Rank-based statistics do not permit decomposition – Avoid them – Use approximations
  5. 5. 5©MapR Technologies - Confidential Inside Map-Reduce 5 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
  6. 6. 6©MapR Technologies - Confidential Don’t Do This Raw Daily Weekly Monthly
  7. 7. 7©MapR Technologies - Confidential Do This Instead Raw Daily Weekly Monthly
  8. 8. 8©MapR Technologies - Confidential Aggregation  First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates  Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time
  9. 9. 9©MapR Technologies - Confidential Rank Statistics Can Be Tamed  Approximate quartiles are easily computed – (but sorted data is evil)  Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates  Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on  Approximate top-40 also possible
  10. 10. 10©MapR Technologies - Confidential Recommendations  Common patterns in the past may predict common patterns in the future  People who bought item x also bought item y  But also, people who bought Chinese food in the past, …  Or people in SoMa really liked this restaurant in the past
  11. 11. 11©MapR Technologies - Confidential People who bought …  Key operation is counting number of people who bought x and y – for all x’s and all y’s  The raw problem appears to be O(N^3)  At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N  Scalable problems must be O(N)
  12. 12. 12©MapR Technologies - Confidential But …  What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing  What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing
  13. 13. 13©MapR Technologies - Confidential Also …  What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞?  What about learning about an item? – how many people do we need to see before we understand the item?
  14. 14. 14©MapR Technologies - Confidential So …  Cheat!  Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest  Now k_max ≤ 1000
  15. 15. 15©MapR Technologies - Confidential The Fundamental Things Apply  Don’t read the raw data repeatedly  Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible  Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences
  16. 16. 16©MapR Technologies - Confidential Deployment Matters, Too  For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix  These can all be stored and accessed using text retrieval techniques  Fast deployment using mirrors and NFS (not standard Hadoop)
  17. 17. 17©MapR Technologies - Confidential Non-Traditional Deployment Demo DEMO
  18. 18. 18©MapR Technologies - Confidential EM Algorithms  Start with random model estimates  Use model estimates to classify examples  Use classified examples to find probability maximum estimates  Use model estimates to classify examples  Use classified examples to find probability maximum estimates  … And so on …
  19. 19. 19©MapR Technologies - Confidential K-means as EM Algorithm  Assign a random seed to each cluster  Assign points to nearest cluster  Move cluster to average of contained points  Assign points to nearest cluster … and so on …
  20. 20. 20©MapR Technologies - Confidential K-means as Map-Reduce  Assignment of points to cluster is trivially parallel  Computation of new clusters is also parallel  Moving points to averages is ideal for map-reduce
  21. 21. 21©MapR Technologies - Confidential But …  With map-reduce, iteration is evil  Starting a program can take 10-30s  Saving data to disk and then immediately reading from disk is silly  Input might even fit in cluster memory
  22. 22. 22©MapR Technologies - Confidential Fix #1  Don’t do that!  Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory  Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce…  Use GraphLab – Like BSP without the speed brakes – 100x faster
  23. 23. 23©MapR Technologies - Confidential Fix #2  Use a sketch-based algorithm  Do one pass over the data to compute sketch of the data  Cluster the sketch  Done. With good theoretic bounds on accuracy  Speedup of 3000x or more
  24. 24. 24©MapR Technologies - Confidential An Example
  25. 25. 25©MapR Technologies - Confidential The Problem  Spirals are a classic “counter” example for k-means  Classic low dimensional manifold with added noise  But clustering still makes modeling work well
  26. 26. 26©MapR Technologies - Confidential An Example
  27. 27. 27©MapR Technologies - Confidential An Example
  28. 28. 28©MapR Technologies - Confidential The Cluster Proximity Features  Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation
  29. 29. 29©MapR Technologies - Confidential Lots of Clusters Are Fine
  30. 30. 30©MapR Technologies - Confidential Surrogate Method  Start with sloppy clustering into κ = k log n clusters  Use this sketch as a weighted surrogate for the data  Cluster surrogate data using ball k-means  Results are provably good for highly clusterable data  Sloppy clustering is on-line  Surrogate can be kept in memory  Ball k-means pass can be done at any time
  31. 31. 31©MapR Technologies - Confidential Algorithm Costs  O(k d log n) per point per iteration for Lloyd’s algorithm  Number of iterations not well known  Iteration > log n reasonable assumption
  32. 32. 32©MapR Technologies - Confidential Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice
  33. 33. 33©MapR Technologies - Confidential Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal
  34. 34. 34©MapR Technologies - Confidential Pragmatics  But this requires a fast search internally  Have to cluster on the fly for sketch  Have to guarantee sketch quality  Previous methods had very high complexity
  35. 35. 35©MapR Technologies - Confidential How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  36. 36. 36©MapR Technologies - Confidential Matrix Decomposition  Many big matrices can often be compressed  Often used in recommendations =
  37. 37. 37©MapR Technologies - Confidential Neighest Neighbor  Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy  Fast search algorithms work up to dimension 50-100, don’t work above that
  38. 38. 38©MapR Technologies - Confidential Random Projections  Many problems in high dimension can be reduce to low dimension  Reductions with good distance approximation are available  Surprisingly, these methods can be done using random vectors
  39. 39. 39©MapR Technologies - Confidential Fundamental Trick  Random orthogonal projection preserves action of A Ax - Ay » QT Ax -QT Ay
  40. 40. 40©MapR Technologies - Confidential Projection Search total ordering!
  41. 41. 41©MapR Technologies - Confidential LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  42. 42. 42©MapR Technologies - Confidential But How? Y = AW Q1R = Y B = Q1 T A LQ2 = B USVT = L (Q1U) S (Q2V)T » A
  43. 43. 43©MapR Technologies - Confidential Summary  Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once  Use approximate measures for rank statistics  Downsample where appropriate  Use non-traditional deployment  Use sketches  Use random projections
  44. 44. 44©MapR Technologies - Confidential Contact Me!  We’re hiring at MapR in US and Europe  Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12  Get the code at https://github.com/tdunning  Contact me at tdunning@maprtech.com or @ted_dunning

×