Your SlideShare is downloading. ×
0
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Paris Data Geeks
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Paris Data Geeks

519

Published on

A quick BOF talk given by Ted Dunning in Paris at Devoxx

A quick BOF talk given by Ted Dunning in Paris at Devoxx

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
519
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Practical Machine Learning with Mahout
  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring) • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • What works at scale • Recommendation • Unsupervised - Clustering
  • 4. What Works at Scale • Logging • Counting • Session grouping
  • 5. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these
  • 6. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these • These are harder than they look
  • 7. Recommendations
  • 8. Recommendations • Special case of reflected intelligence • Traditionally “people who bought x also bought y” • But soooo much more is possible
  • 9. Examples • Customers buying books (Linden et al) • Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix) • Internet radio listeners not skipping songs (Musicmatch) • Internet video watchers watching >30 s
  • 10. Dyadic Structure • Functional – Interaction: actor -> item* • Relational – Interaction ⊆ Actors x Items • Matrix – Rows indexed by actor, columns by item – Value is count of interactions • Predict missing observations
  • 11. Recommendations Analysis • R(x,y) = # people who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 12. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 13. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 14. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 15. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 16. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 17. Recommendations Analysis Rij = AuiBuju å = AT B
  • 18. Fundamental Algorithmic Structure • Cooccurrence • Matrix approximation by factoring • LLR K = AT A A » USVT K » VS2 VT r = VS2 VT h r =sparsify(AT A)h
  • 19. But Wait! • Cooccurrence • Cross occurrence K = AT A K = BT A
  • 20. For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • A’A gives query recommendation – “did you mean to ask for” • B’B gives video recommendation – “you might like these videos”
  • 21. The punch-line • B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta- data)
  • 22. Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres del paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 23. Real-life example
  • 24. Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 25. Super-fast k-means Clustering
  • 26. RATIONALE
  • 27. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization is critical • Agreement to “gold standard” is a non-issue
  • 28. An Example
  • 29. An Example
  • 30. Diagonalized Cluster Proximity
  • 31. Clusters as Distribution Surrogate
  • 32. Clusters as Distribution Surrogate
  • 33. THEORY
  • 34. For Example Grouping these two clusters seriously hurts squared distance D4 2 (X) > 1 s 2 D5 2 (X)
  • 35. ALGORITHMS
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 37. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 38. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 39. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O( nkd + k3d ) time • But for big data, k gets large
  • 40. Surrogate Method • Start with sloppy clustering into lots of clusters κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Results are provably good for highly clusterable data
  • 41. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 42. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice
  • 43. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 44. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 45. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 46. IMPLEMENTATION
  • 47. But Wait, … • Finding nearest centroid is inner loop • This could take O( d κ ) per point and κ can be big • Happily, approximate nearest centroid works fine
  • 48. Projection Search total ordering!
  • 49. LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  • 50. RESULTS
  • 51. Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 52. Quality • Ball k-means implementation appears significantly better than simple k-means • Streaming k-means + ball k-means appears to be about as good as ball k-means alone • All evaluations on 20 newsgroups with held-out data • Figure of merit is mean and median squared distance to nearest cluster
  • 53. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Get the code as part of Mahout trunk (or 0.8 very soon) • Contact me at tdunning@maprtech.com or @ted_dunning • Share news with @apachemahout

×