News from Mahout

News From Mahout

©MapR Technologies - Confidential 1

whoami – Ted Dunning

 Chief Application Architect, MapR Technologies
 Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill

(we’re hiring)

 Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning


 Slides and such (available late tonight):
– http://www.mapr.com/company/events/nyhug-03-05-2013

 Hash tags: #mapr #nyhug #mahout


New in Mahout

 0.8 is coming soon (1-2 months)
 gobs of fixes
 QR decomposition is 10x faster
– makes ALS 2-3 times faster
 May include Bayesian Bandits
 Super fast k-means
– fast
– online (!?!)


New in Mahout

 gobs of fixes
– fast
– online (!?!)
– fast
 Possible new edition of MiA coming
– Japanese and Korean editions released, Chinese coming


Real-time Learning


We have a product
to sell …
from a web-site


What tag-
What line?
picture?
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!

Buy 5!

What call to
action?


The Challenge

 Design decisions affect probability of success
– Cheesy web-sites don’t even sell cheese

 The best designers do better when allowed to fail
– Exploration juices creativity

 But failing is expensive
– If only because we could have succeeded
– But also because offending or disappointing customers is bad


More Challenges

 Too many designs
– 5 pictures
– 10 tag-lines
– 4 calls to action
– 3 back-ground colors
=> 5 x 10 x 4 x 3 = 600 designs

 It gets worse quickly
– What about changes on the back-end?
– Search engine variants?
– Checkout process variants?


Example – AB testing in real-time

 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic


A Quick Diversion

 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?


A Philosophical Conclusion

 Probability as expressed by humans is subjective and depends on
information and experience


I Dunno


5 heads out of 10 throws


2 heads out of 12 throws


So now you understand
Bayesian probability


Another Quick Diversion

 Let’s play a shell game
 This is a special shell game
 It costs you nothing to play
 The pea has constant probability of being under each shell
(trust me)
 How do you find the best shell?
 How do you find it while maximizing the number of wins?


Pause for short
con-game


Interim Thoughts

 Can you identify winners or losers without trying them out?

 Can you ever completely eliminate a shell with a bad streak?

 Should you keep trying apparent losers?


So now you understand
multi-armed bandits


Conclusions

 Can you identify winners or losers without trying them out?
No

 Can you ever completely eliminate a shell with a bad streak?
No

 Should you keep trying apparent losers?
Yes, but at a decreasing rate


Is there an optimum
strategy?


Bayesian Bandit

 Compute distributions based on data so far
 Sample p1, p2 and p2 from these distributions
 Pick shell i where i = argmaxi pi

 Lemma 1: The probability of picking shell i will match the
probability it is the best shell

 Lemma 2: This is as good as it gets


And it works!

0.12

0.11

0.1

0.09

0.08

0.07
regret

0.06
ε- greedy, ε = 0.05
0.05

0.04 Bayesian Bandit with Gam m a- Norm al
0.03

0.02

0.01

0
0 100 200 300 400 500 600 700 800 900 1000 1100

n


Video Demo


The Code

 Select an alternative
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))

 Select and learn
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)

 But we already know how to count!


The Basic Idea

 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation

 Can be extended to more general response models


The Original Problem

x2
x1

bags!

Buy 5!

x3


Response Function

æ ö
p(win) = w çåqi xi ÷
è i ø
1

0.5
y

0
-6 -4 -2 0 2 4 6
x


Generalized Banditry

 Suppose we have an infinite number of bandits
– suppose they are each labeled by two real numbers x and y in [0,1]
– also that expected payoff is a parameterized function of x and y

E [ z ] = f (x, y | q )
– now assume a distribution for θ that we can learn online
 Selection works by sampling θ, then computing f
 Learning works by propagating updates back to θ
– If f is linear, this is very easy
– For special other kinds of f it isn’t too hard
 Don’t just have to have two labels, could have labels and context


Context Variables

x2
x1

bags!

Buy 5!

x3

user.geo env.time env.day_of_week env.weekend


Caveats

 Original Bayesian Bandit only requires real-time

 Generalized Bandit may require access to long history for learning
– Pseudo online learning may be easier than true online

 Bandit variables can include content, time of day, day of week

 Context variables can include user id, user features

 Bandit × context variables provide the real power


You can do this
yourself!


Super-fast k-means Clustering


Rationale


What is Quality?

 Robust clustering not a goal
– we don’t care if the same clustering is replicated
 Generalization is critical
 Agreement to “gold standard” is a non-issue


An Example


Diagonalized Cluster Proximity


Clusters as Distribution Surrogate


Theory


For Example

1
D (X) >
2
D (X)
2

s
4 2 5

Grouping these
two clusters
seriously hurts
squared distance


Algorithms


Typical k-means Failure

Selecting two seeds
here cannot be
fixed with Lloyds

Result is that these two
clusters get glued
together


Ball k-means

 Provably better for highly clusterable data
 Tries to find initial centroids in each “core” of each real clusters
 Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster


Still Not a Win

 Ball k-means is nearly guaranteed with k = 2
 Probability of successful seeding drops exponentially with k
 Alternative strategy has high probability of success, but takes
O(nkd + k3d) time


Still Not a Win

 Ball k-means is nearly guaranteed with k = 2
 Probability of successful seeding drops exponentially with k
 Alternative strategy has high probability of success, but takes O(
nkd + k3d ) time

 But for big data, k gets large


Surrogate Method

 Start with sloppy clustering into lots of clusters
κ = k log n clusters
 Use this sketch as a weighted surrogate for the data
 Results are provably good for highly clusterable data


Algorithm Costs

 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice


Algorithm Costs

 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice


Algorithm Costs

 How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal


How It Works

 For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
 If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold


Implementation


But Wait, …

 Finding nearest centroid is inner loop

 This could take O( d κ ) per point and κ can be big

 Happily, approximate nearest centroid works fine


Projection Search

total ordering!


LSH Bit-match Versus Cosine
1

0.8

0.6

0.4

0.2
Y Ax is

0
0 8 16 24 32 40 48 56 64

- 0.2

- 0.4

- 0.6

- 0.8

-1

X Ax is


Results


Parallel Speedup?

200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

Quality

 Ball k-means implementation appears significantly better than
simple k-means

 Streaming k-means + ball k-means appears to be about as good as
ball k-means alone

 All evaluations on 20 newsgroups with held-out data

 Figure of merit is mean and median squared distance to nearest
cluster


Contact Me!

 We’re hiring at MapR in US and Europe

 MapR software available for research use

 Get the code as part of Mahout trunk (or 0.8 very soon)

 Contact me at tdunning@maprtech.com or @ted_dunning

 Share news with @apachemahout


News from Mahout

Recommended

Recommended

More Related Content

Similar to News from Mahout

Similar to News from Mahout (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

News from Mahout

Editor's Notes