Strata new-york-2012

Online Learning
Bayesian bandits and more

©MapR Technologies - Confidential 1

whoami – Ted Dunning

 Ted Dunning
tdunning@maprtech.com
tdunning@apache.org
@ted_dunning

 We’re hiring at MapR

 For slides and other info
http://www.slideshare.net/tdunning


Online
Scalable
Incremental

Scalability and Learning

 What does scalable mean?

 What are inherent characteristics of scalable learning?

 What are the logical implications?


Scalable ≈ On-line
If you squint just right


unit of work ≈ unit of time


Infinite
Data Learning
Stream

State


Pick One


Now pick again


A Quick Diversion

 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?


Which One to Play?

 One may be better than the other
 The better coin pays off at some rate
 Playing the other will pay off at a lesser rate
– Playing the lesser coin has “opportunity cost”

 But how do we know which is which?
– Explore versus Exploit!


A First Conclusion

 Probability as expressed by humans is subjective and depends on
information and experience


A Second Conclusion

 A single number is a bad way to express uncertain knowledge

 A distribution of values might be better


I Dunno


5 and 5


2 and 10


The Cynic Among Us


Demo


An Example


The Cluster Proximity Features

 Every point can be described by the nearest cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point) by increasing number of
clusters
 Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation


Diagonalized Cluster Proximity


Lots of Clusters Are Fine


Surrogate Method

 Start with sloppy clustering into κ = k log n clusters
 Use these clusters as a weighted surrogate for the data
 Cluster surrogate data using ball k-means

 Results are provably high quality for highly clusterable data
 Sloppy clustering can be done on-line
 Surrogate can be kept in memory
 Ball k-means pass can be done at any time


Algorithm Costs

 O(k d log n) per point for Lloyd’s algorithm
… not so good for k = 2000, n = 108

 Surrogate methods
…. O(d log κ) = O(d (log k + log log n)) per point

 This is a big deal:
– k d log n = 2000 x 10 x 26 = 500,000
– log k + log log n = 11 + 5 = 17
– 30,000 times faster makes the grade as a bona fide big deal


30,000 times faster sounds good



but that isn’t the big news



but that isn’t the big news

these algorithms do
on-line clustering


Parallel Speedup?

200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

What about deployment?


Infinite
Data Learning
Stream

State


Data
Mapper
Split

State


Data
Mapper
Mapper
Split Mapper

Need shared
memory! State


whoami – Ted Dunning

 We’re hiring at MapR

 Ted Dunning
tdunning@maprtech.com
tdunning@apache.org
@ted_dunning

 For slides and other info
http://www.slideshare.net/tdunning


Strata new-york-2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strata new-york-2012

Similar to Strata new-york-2012 (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

Strata new-york-2012