Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

# Cmu Lecture on Hadoop Performance

1,724

Published on

A lecture describing several ways to make Hadoop programs go faster.

A lecture describing several ways to make Hadoop programs go faster.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

No Downloads
Views
Total Views
1,724
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
40
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Hadoop Performance©MapR Technologies - Confidential 1
• 2. Agenda What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition©MapR Technologies - Confidential 2
• 3. What is Performance? Is doing something faster better? Is it the right task? Do you have a wide enough view? What is the right performance metric?©MapR Technologies - Confidential 3
• 4. Aggregation Word-count and friends – How many times did X occur? – How many unique X’s occurred? Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates Rank-based statistics do not permit decomposition – Avoid them – Use approximations©MapR Technologies - Confidential 4
• 5. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax [1,5,2] has, come, 6 come, 1 the, [1,2,1] has, 8 … time, [10,1,3 the, 4 ] time, 14 Input Map Combine Shuffle … Reduce … Output and sort Reduce©MapR Technologies - Confidential 5 5
• 6. Don’t Do This Daily Raw Weekly Monthly©MapR Technologies - Confidential 6
• 7. Do This Instead Daily Weekly Raw Monthly©MapR Technologies - Confidential 7
• 8. Aggregation First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time©MapR Technologies - Confidential 8
• 9. Rank Statistics Can Be Tamed Approximate quartiles are easily computed – (but sorted data is evil) Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on Approximate top-40 also possible©MapR Technologies - Confidential 9
• 10. Recommendations Common patterns in the past may predict common patterns in the future People who bought item x also bought item y But also, people who bought Chinese food in the past, … Or people in SoMa really liked this restaurant in the past©MapR Technologies - Confidential 10
• 11. People who bought … Key operation is counting number of people who bought x and y – for all x’s and all y’s The raw problem appears to be O(N^3) At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N Scalable problems must be O(N)©MapR Technologies - Confidential 11
• 12. But … What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing©MapR Technologies - Confidential 12
• 13. Also … What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞? What about learning about an item? – how many people do we need to see before we understand the item?©MapR Technologies - Confidential 13
• 14. So … Cheat! Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest Now k_max ≤ 1000©MapR Technologies - Confidential 14
• 15. The Fundamental Things Apply Don’t read the raw data repeatedly Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences©MapR Technologies - Confidential 15
• 16. Deployment Matters, Too For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix These can all be stored and accessed using text retrieval techniques Fast deployment using mirrors and NFS (not standard Hadoop)©MapR Technologies - Confidential 16
• 17. Non-Traditional Deployment Demo DEMO©MapR Technologies - Confidential 17
• 18. EM Algorithms Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …©MapR Technologies - Confidential 18
• 19. K-means as EM Algorithm Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster … and so on …©MapR Technologies - Confidential 19
• 20. K-means as Map-Reduce Assignment of points to cluster is trivially parallel Computation of new clusters is also parallel Moving points to averages is ideal for map-reduce©MapR Technologies - Confidential 20
• 21. But … With map-reduce, iteration is evil Starting a program can take 10-30s Saving data to disk and then immediately reading from disk is silly Input might even fit in cluster memory©MapR Technologies - Confidential 21
• 22. Fix #1 Don’t do that! Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce… Use GraphLab – Like BSP without the speed brakes – 100x faster©MapR Technologies - Confidential 22
• 23. Fix #2 Use a sketch-based algorithm Do one pass over the data to compute sketch of the data Cluster the sketch Done. With good theoretic bounds on accuracy Speedup of 3000x or more©MapR Technologies - Confidential 23
• 24. An Example©MapR Technologies - Confidential 24
• 25. The Problem Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise But clustering still makes modeling work well©MapR Technologies - Confidential 25
• 26. An Example©MapR Technologies - Confidential 26
• 27. An Example©MapR Technologies - Confidential 27
• 28. The Cluster Proximity Features Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation©MapR Technologies - Confidential 28
• 29. Lots of Clusters Are Fine©MapR Technologies - Confidential 29
• 30. Surrogate Method Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time©MapR Technologies - Confidential 30
• 31. Algorithm Costs O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption©MapR Technologies - Confidential 31
• 32. Algorithm Costs Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice©MapR Technologies - Confidential 32
• 33. Algorithm Costs How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal©MapR Technologies - Confidential 33
• 34. Pragmatics But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity©MapR Technologies - Confidential 34
• 35. How It Works For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 35
• 36. Matrix Decomposition Many big matrices can often be compressed = Often used in recommendations©MapR Technologies - Confidential 36
• 37. Neighest Neighbor Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy Fast search algorithms work up to dimension 50-100, don’t work above that©MapR Technologies - Confidential 37
• 38. Random Projections Many problems in high dimension can be reduce to low dimension Reductions with good distance approximation are available Surprisingly, these methods can be done using random vectors©MapR Technologies - Confidential 38
• 39. Fundamental Trick Random orthogonal projection preserves action of A Ax - Ay » Q Ax - Q Ay T T©MapR Technologies - Confidential 39
• 40. Projection Search total ordering!©MapR Technologies - Confidential 40
• 41. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is©MapR Technologies - Confidential 41
• 42. But How? Y = AW Q1 R = Y B = Q1T A LQ2 = B USV = L T (Q1U) S (Q2V ) » A T©MapR Technologies - Confidential 42
• 43. Summary Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections©MapR Technologies - Confidential 43
• 44. Contact Me! We’re hiring at MapR in US and Europe Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12 Get the code at https://github.com/tdunning Contact me at tdunning@maprtech.com or @ted_dunning©MapR Technologies - Confidential 44