SlideShare a Scribd company logo
News From Mahout


©MapR Technologies - Confidential   1
whoami – Ted Dunning

     Chief Application Architect, MapR Technologies
     Committer, member, Apache Software Foundation
       –   particularly Mahout, Zookeeper and Drill


                       (we’re hiring)

     Contact me at
       tdunning@maprtech.com
       tdunning@apache.com
       ted.dunning@gmail.com
       @ted_dunning



©MapR Technologies - Confidential            2
     Slides and such (available late tonight):
       –   http://www.mapr.com/company/events/nyhug-03-05-2013



     Hash tags: #mapr #nyhug #mahout




©MapR Technologies - Confidential        3
New in Mahout

     0.8 is coming soon (1-2 months)
     gobs of fixes
     QR decomposition is 10x faster
       –   makes ALS 2-3 times faster
     May include Bayesian Bandits
     Super fast k-means
       –   fast
       –   online (!?!)




©MapR Technologies - Confidential       4
New in Mahout

     0.8 is coming soon (1-2 months)
     gobs of fixes
     QR decomposition is 10x faster
       –   makes ALS 2-3 times faster
     May include Bayesian Bandits
     Super fast k-means
       –   fast
       –   online (!?!)
       –   fast
     Possible new edition of MiA coming
       –   Japanese and Korean editions released, Chinese coming


©MapR Technologies - Confidential           5
New in Mahout

     0.8 is coming soon (1-2 months)
     gobs of fixes
     QR decomposition is 10x faster
       –   makes ALS 2-3 times faster
     May include Bayesian Bandits
     Super fast k-means
       –   fast
       –   online (!?!)
       –   fast
     Possible new edition of MiA coming
       –   Japanese and Korean editions released, Chinese coming


©MapR Technologies - Confidential           6
Real-time Learning

©MapR Technologies - Confidential       7
We have a product
                             to sell 

                                    from a web-site


©MapR Technologies - Confidential        8
What tag-
                               What                                 line?
                              picture?
                                                  Bogus Dog Food is the Best!
                                                  Now available in handy 1 ton
                                                  bags!



                                         Buy 5!




                                                    What call to
                                                     action?




©MapR Technologies - Confidential                           9
The Challenge

     Design decisions affect probability of success
       –   Cheesy web-sites don’t even sell cheese



     The best designers do better when allowed to fail
       –   Exploration juices creativity




     But failing is expensive
       –   If only because we could have succeeded
       –   But also because offending or disappointing customers is bad


©MapR Technologies - Confidential            10
More Challenges

     Too many designs
       – 5 pictures
       – 10 tag-lines
       – 4 calls to action
       – 3 back-ground colors
       => 5 x 10 x 4 x 3 = 600 designs


     It gets worse quickly
       –   What about changes on the back-end?
       –   Search engine variants?
       –   Checkout process variants?




©MapR Technologies - Confidential         11
Example – AB testing in real-time

     I have 15 versions of my landing page
     Each visitor is assigned to a version
       –   Which version?
     A conversion or sale or whatever can happen
       –   How long to wait?
     Some versions of the landing page are horrible
       –   Don’t want to give them traffic




©MapR Technologies - Confidential            12
A Quick Diversion

     You see a coin
       –   What is the probability of heads?
       –   Could it be larger or smaller than that?
     I flip the coin and while it is in the air ask again
     I catch the coin and ask again
     I look at the coin (and you don’t) and ask again
     Why does the answer change?
       –   And did it ever have a single value?




©MapR Technologies - Confidential             13
A Philosophical Conclusion

     Probability as expressed by humans is subjective and depends on
      information and experience




©MapR Technologies - Confidential    14
I Dunno




©MapR Technologies - Confidential   15
5 heads out of 10 throws




©MapR Technologies - Confidential   16
2 heads out of 12 throws




©MapR Technologies - Confidential   17
So now you understand
                   Bayesian probability



©MapR Technologies - Confidential   18
Another Quick Diversion

     Let’s play a shell game
     This is a special shell game
     It costs you nothing to play
     The pea has constant probability of being under each shell
           (trust me)
     How do you find the best shell?
     How do you find it while maximizing the number of wins?




©MapR Technologies - Confidential       19
Pause for short
                                    con-game



©MapR Technologies - Confidential          20
Interim Thoughts

     Can you identify winners or losers without trying them out?


     Can you ever completely eliminate a shell with a bad streak?


     Should you keep trying apparent losers?




©MapR Technologies - Confidential    21
So now you understand
                   multi-armed bandits



©MapR Technologies - Confidential   22
Conclusions

     Can you identify winners or losers without trying them out?
       No


     Can you ever completely eliminate a shell with a bad streak?
       No


     Should you keep trying apparent losers?
       Yes, but at a decreasing rate




©MapR Technologies - Confidential      23
Is there an optimum
                   strategy?



©MapR Technologies - Confidential   24
Bayesian Bandit

     Compute distributions based on data so far
     Sample p1, p2 and p2 from these distributions
     Pick shell i where i = argmaxi pi


     Lemma 1: The probability of picking shell i will match the
      probability it is the best shell


     Lemma 2: This is as good as it gets




©MapR Technologies - Confidential         25
And it works!


                                    0.12


                                    0.11


                                     0.1


                                    0.09


                                    0.08


                                    0.07
                           regret




                                    0.06
                                                                 Δ- greedy, Δ = 0.05
                                    0.05


                                    0.04                                               Bayesian Bandit with Gam m a- Norm al
                                    0.03


                                    0.02


                                    0.01


                                      0
                                           0   100   200   300       400    500        600    700    800    900    1000   1100

                                                                                   n




©MapR Technologies - Confidential                                                 26
Video Demo




©MapR Technologies - Confidential       27
The Code

     Select an alternative
                   n = dim(k)[1]
                   p0 = rep(0, length.out=n)
                   for (i in 1:n) {
                     p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
                   }
                   return (which(p0 == max(p0)))


     Select and learn
                  for (z in 1:steps) {
                     i = select(k)
                     j = test(i)
                     k[i,j] = k[i,j]+1
                   }
                   return (k)




     But we already know how to count!

©MapR Technologies - Confidential                           28
The Basic Idea

     We can encode a distribution by sampling
     Sampling allows unification of exploration and exploitation


     Can be extended to more general response models




©MapR Technologies - Confidential     29
The Original Problem

                                                                      x2
                                    x1

                                                  Bogus Dog Food is the Best!
                                                  Now available in handy 1 ton
                                                  bags!



                                         Buy 5!




                                                        x3




©MapR Technologies - Confidential                            30
Response Function


                                                        Ê       ö
                                             p(win) = w çÄqi xi ÷
                                                        Ăš i     Ăž
                                                        1




                                                       0.5
                                    y




                                                        0
                                        -6   -4   -2         0   2   4   6
                                                             x




©MapR Technologies - Confidential                      31
Generalized Banditry

     Suppose we have an infinite number of bandits
       –   suppose they are each labeled by two real numbers x and y in [0,1]
       –   also that expected payoff is a parameterized function of x and y

                                    E [ z ] = f (x, y | q )
       –   now assume a distribution for ξ that we can learn online
     Selection works by sampling Ξ, then computing f
     Learning works by propagating updates back to Ξ
       –   If f is linear, this is very easy
       –   For special other kinds of f it isn’t too hard
     Don’t just have to have two labels, could have labels and context



©MapR Technologies - Confidential                    32
Context Variables

                                                                         x2
                                    x1

                                                     Bogus Dog Food is the Best!
                                                     Now available in handy 1 ton
                                                     bags!



                                            Buy 5!




                                                           x3


               user.geo                  env.time          env.day_of_week          env.weekend


©MapR Technologies - Confidential                               33
Caveats

     Original Bayesian Bandit only requires real-time


     Generalized Bandit may require access to long history for learning
       –   Pseudo online learning may be easier than true online


     Bandit variables can include content, time of day, day of week


     Context variables can include user id, user features


     Bandit × context variables provide the real power


©MapR Technologies - Confidential            34
You can do this
                                       yourself!



©MapR Technologies - Confidential         35
Super-fast k-means Clustering

©MapR Technologies - Confidential   36
Rationale


©MapR Technologies - Confidential   37
What is Quality?

     Robust clustering not a goal
       –   we don’t care if the same clustering is replicated
     Generalization is critical
     Agreement to “gold standard” is a non-issue




©MapR Technologies - Confidential              38
An Example




©MapR Technologies - Confidential   39
An Example




©MapR Technologies - Confidential   40
Diagonalized Cluster Proximity




©MapR Technologies - Confidential   41
Clusters as Distribution Surrogate




©MapR Technologies - Confidential   42
Clusters as Distribution Surrogate




©MapR Technologies - Confidential   43
Theory


©MapR Technologies - Confidential   44
For Example


                                                   1
                                    D (X) >
                                     2
                                                         D (X)
                                                          2

                                                 s
                                     4               2    5




                                               Grouping these
                                                  two clusters
                                                seriously hurts
                                              squared distance




©MapR Technologies - Confidential        45
Algorithms


©MapR Technologies - Confidential   46
Typical k-means Failure


                                    Selecting two seeds
                                         here cannot be
                                       fixed with Lloyds

                                                   Result is that these two
                                                         clusters get glued
                                                                   together




©MapR Technologies - Confidential                          47
Ball k-means

     Provably better for highly clusterable data
     Tries to find initial centroids in each “core” of each real clusters
     Avoids outliers in centroid computation

       initialize centroids randomly with distance maximizing tendency
       for each of a very few iterations:
         for each data point:
               assign point to nearest cluster
         recompute centroids using only points much closer than closest cluster




©MapR Technologies - Confidential          48
Still Not a Win

     Ball k-means is nearly guaranteed with k = 2
     Probability of successful seeding drops exponentially with k
     Alternative strategy has high probability of success, but takes
      O(nkd + k3d) time




©MapR Technologies - Confidential      49
Still Not a Win

     Ball k-means is nearly guaranteed with k = 2
     Probability of successful seeding drops exponentially with k
     Alternative strategy has high probability of success, but takes O(
      nkd + k3d ) time


     But for big data, k gets large




©MapR Technologies - Confidential      50
Surrogate Method

     Start with sloppy clustering into lots of clusters
                       Îș = k log n clusters
     Use this sketch as a weighted surrogate for the data
     Results are provably good for highly clusterable data




©MapR Technologies - Confidential             51
Algorithm Costs

     Surrogate methods
       –   fast, sloppy single pass clustering with Îș = k log n
       –   fast sloppy search for nearest cluster,
                O(d log Îș) = O(d (log k + log log n)) per point
       –   fast, in-memory, high-quality clustering of Îș weighted centroids
                       O(Îș k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
                       O(Îș d log k) or O(d log Îș log k) for larger k, looser quality
       –   result is k high-quality centroids
            ‱    Even the sloppy surrogate may suffice




©MapR Technologies - Confidential                           52
Algorithm Costs

     Surrogate methods
       –   fast, sloppy single pass clustering with Îș = k log n
       –   fast sloppy search for nearest cluster,
                O(d log Îș) = O(d ( log k + log log n )) per point
       –   fast, in-memory, high-quality clustering of Îș weighted centroids
                       O(Îș k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
                       O(Îș d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
       –   result is k high-quality centroids
            ‱    For many purposes, even the sloppy surrogate may suffice




©MapR Technologies - Confidential                           53
Algorithm Costs

     How much faster for the sketch phase?
       –   take k = 2000, d = 10, n = 100,000
       –   k d log n = 2000 x 10 x 26 = 500,000
       –   d (log k + log log n) = 10(11 + 5) = 170
       –   3,000 times faster is a bona fide big deal




©MapR Technologies - Confidential              54
Algorithm Costs

     How much faster for the sketch phase?
       –   take k = 2000, d = 10, n = 100,000
       –   k d log n = 2000 x 10 x 26 = 500,000
       –   d (log k + log log n) = 10(11 + 5) = 170
       –   3,000 times faster is a bona fide big deal




©MapR Technologies - Confidential              55
How It Works

     For each point
       –   Find approximately nearest centroid (distance = d)
       –   If (d > threshold) new centroid
       –   Else if (u > d/threshold) new cluster
       –   Else add to nearest centroid
     If centroids > Îș ≈ C log N
       –   Recursively cluster centroids with higher threshold




©MapR Technologies - Confidential             56
Implementation


©MapR Technologies - Confidential   57
But Wait, 


     Finding nearest centroid is inner loop


     This could take O( d Îș ) per point and Îș can be big


     Happily, approximate nearest centroid works fine




©MapR Technologies - Confidential      58
Projection Search


                                         total ordering!




©MapR Technologies - Confidential   59
LSH Bit-match Versus Cosine
                       1


                     0.8


                     0.6


                     0.4


                     0.2
          Y Ax is




                       0
                            0       8   16   24    32       40   48   56   64

                    - 0.2


                    - 0.4


                    - 0.6


                    - 0.8


                      -1

                                                  X Ax is




©MapR Technologies - Confidential                    60
Results


©MapR Technologies - Confidential   61
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (ÎŒs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       62
Quality

     Ball k-means implementation appears significantly better than
      simple k-means


     Streaming k-means + ball k-means appears to be about as good as
      ball k-means alone


     All evaluations on 20 newsgroups with held-out data


     Figure of merit is mean and median squared distance to nearest
      cluster


©MapR Technologies - Confidential    63
Contact Me!

     We’re hiring at MapR in US and Europe

     MapR software available for research use

     Get the code as part of Mahout trunk (or 0.8 very soon)

     Contact me at tdunning@maprtech.com or @ted_dunning

     Share news with @apachemahout




©MapR Technologies - Confidential    64

More Related Content

Similar to News from Mahout

News From Mahout
News From MahoutNews From Mahout
News From Mahout
MapR Technologies
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
MapR Technologies
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
Ted Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
Ted Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
MapR Technologies
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
MapR Technologies
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
Ted Dunning
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
Ted Dunning
 
London data science
London data scienceLondon data science
London data science
Ted Dunning
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
MapR Technologies
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
MapR Technologies
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
Data Science London
 
Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
Ted Dunning
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahoutMapR Technologies
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
Ted Dunning
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
MapR Technologies
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
DataWorks Summit/Hadoop Summit
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
Machine Learning - What, Where and How
Machine Learning - What, Where and HowMachine Learning - What, Where and How
Machine Learning - What, Where and Hownarinderk
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
MapR Technologies
 

Similar to News from Mahout (20)

News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
 
London data science
London data scienceLondon data science
London data science
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Machine Learning - What, Where and How
Machine Learning - What, Where and HowMachine Learning - What, Where and How
Machine Learning - What, Where and How
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning
 
T digest-update
T digest-updateT digest-update
T digest-update
Ted Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Recently uploaded

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilotℱ
Le nuove frontiere dell'AI nell'RPA con UiPath AutopilotℱLe nuove frontiere dell'AI nell'RPA con UiPath Autopilotℱ
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilotℱ
UiPathCommunity
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilotℱ
Le nuove frontiere dell'AI nell'RPA con UiPath AutopilotℱLe nuove frontiere dell'AI nell'RPA con UiPath Autopilotℱ
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilotℱ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

News from Mahout

  • 1. News From Mahout ©MapR Technologies - Confidential 1
  • 2. whoami – Ted Dunning  Chief Application Architect, MapR Technologies  Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring)  Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning ©MapR Technologies - Confidential 2
  • 3.  Slides and such (available late tonight): – http://www.mapr.com/company/events/nyhug-03-05-2013  Hash tags: #mapr #nyhug #mahout ©MapR Technologies - Confidential 3
  • 4. New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) ©MapR Technologies - Confidential 4
  • 5. New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) – fast  Possible new edition of MiA coming – Japanese and Korean editions released, Chinese coming ©MapR Technologies - Confidential 5
  • 6. New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) – fast  Possible new edition of MiA coming – Japanese and Korean editions released, Chinese coming ©MapR Technologies - Confidential 6
  • 8. We have a product to sell 
 from a web-site ©MapR Technologies - Confidential 8
  • 9. What tag- What line? picture? Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! What call to action? ©MapR Technologies - Confidential 9
  • 10. The Challenge  Design decisions affect probability of success – Cheesy web-sites don’t even sell cheese  The best designers do better when allowed to fail – Exploration juices creativity  But failing is expensive – If only because we could have succeeded – But also because offending or disappointing customers is bad ©MapR Technologies - Confidential 10
  • 11. More Challenges  Too many designs – 5 pictures – 10 tag-lines – 4 calls to action – 3 back-ground colors => 5 x 10 x 4 x 3 = 600 designs  It gets worse quickly – What about changes on the back-end? – Search engine variants? – Checkout process variants? ©MapR Technologies - Confidential 11
  • 12. Example – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic ©MapR Technologies - Confidential 12
  • 13. A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value? ©MapR Technologies - Confidential 13
  • 14. A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience ©MapR Technologies - Confidential 14
  • 15. I Dunno ©MapR Technologies - Confidential 15
  • 16. 5 heads out of 10 throws ©MapR Technologies - Confidential 16
  • 17. 2 heads out of 12 throws ©MapR Technologies - Confidential 17
  • 18. So now you understand Bayesian probability ©MapR Technologies - Confidential 18
  • 19. Another Quick Diversion  Let’s play a shell game  This is a special shell game  It costs you nothing to play  The pea has constant probability of being under each shell (trust me)  How do you find the best shell?  How do you find it while maximizing the number of wins? ©MapR Technologies - Confidential 19
  • 20. Pause for short con-game ©MapR Technologies - Confidential 20
  • 21. Interim Thoughts  Can you identify winners or losers without trying them out?  Can you ever completely eliminate a shell with a bad streak?  Should you keep trying apparent losers? ©MapR Technologies - Confidential 21
  • 22. So now you understand multi-armed bandits ©MapR Technologies - Confidential 22
  • 23. Conclusions  Can you identify winners or losers without trying them out? No  Can you ever completely eliminate a shell with a bad streak? No  Should you keep trying apparent losers? Yes, but at a decreasing rate ©MapR Technologies - Confidential 23
  • 24. Is there an optimum strategy? ©MapR Technologies - Confidential 24
  • 25. Bayesian Bandit  Compute distributions based on data so far  Sample p1, p2 and p2 from these distributions  Pick shell i where i = argmaxi pi  Lemma 1: The probability of picking shell i will match the probability it is the best shell  Lemma 2: This is as good as it gets ©MapR Technologies - Confidential 25
  • 26. And it works! 0.12 0.11 0.1 0.09 0.08 0.07 regret 0.06 Δ- greedy, Δ = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n ©MapR Technologies - Confidential 26
  • 27. Video Demo ©MapR Technologies - Confidential 27
  • 28. The Code  Select an alternative n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))  Select and learn for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)  But we already know how to count! ©MapR Technologies - Confidential 28
  • 29. The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models ©MapR Technologies - Confidential 29
  • 30. The Original Problem x2 x1 Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x3 ©MapR Technologies - Confidential 30
  • 31. Response Function ĂŠ ö p(win) = w çÄqi xi Ă· Ăš i Ăž 1 0.5 y 0 -6 -4 -2 0 2 4 6 x ©MapR Technologies - Confidential 31
  • 32. Generalized Banditry  Suppose we have an infinite number of bandits – suppose they are each labeled by two real numbers x and y in [0,1] – also that expected payoff is a parameterized function of x and y E [ z ] = f (x, y | q ) – now assume a distribution for Ξ that we can learn online  Selection works by sampling Ξ, then computing f  Learning works by propagating updates back to Ξ – If f is linear, this is very easy – For special other kinds of f it isn’t too hard  Don’t just have to have two labels, could have labels and context ©MapR Technologies - Confidential 32
  • 33. Context Variables x2 x1 Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x3 user.geo env.time env.day_of_week env.weekend ©MapR Technologies - Confidential 33
  • 34. Caveats  Original Bayesian Bandit only requires real-time  Generalized Bandit may require access to long history for learning – Pseudo online learning may be easier than true online  Bandit variables can include content, time of day, day of week  Context variables can include user id, user features  Bandit × context variables provide the real power ©MapR Technologies - Confidential 34
  • 35. You can do this yourself! ©MapR Technologies - Confidential 35
  • 36. Super-fast k-means Clustering ©MapR Technologies - Confidential 36
  • 38. What is Quality?  Robust clustering not a goal – we don’t care if the same clustering is replicated  Generalization is critical  Agreement to “gold standard” is a non-issue ©MapR Technologies - Confidential 38
  • 39. An Example ©MapR Technologies - Confidential 39
  • 40. An Example ©MapR Technologies - Confidential 40
  • 41. Diagonalized Cluster Proximity ©MapR Technologies - Confidential 41
  • 42. Clusters as Distribution Surrogate ©MapR Technologies - Confidential 42
  • 43. Clusters as Distribution Surrogate ©MapR Technologies - Confidential 43
  • 45. For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance ©MapR Technologies - Confidential 45
  • 47. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together ©MapR Technologies - Confidential 47
  • 48. Ball k-means  Provably better for highly clusterable data  Tries to find initial centroids in each “core” of each real clusters  Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster ©MapR Technologies - Confidential 48
  • 49. Still Not a Win  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O(nkd + k3d) time ©MapR Technologies - Confidential 49
  • 50. Still Not a Win  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O( nkd + k3d ) time  But for big data, k gets large ©MapR Technologies - Confidential 50
  • 51. Surrogate Method  Start with sloppy clustering into lots of clusters Îș = k log n clusters  Use this sketch as a weighted surrogate for the data  Results are provably good for highly clusterable data ©MapR Technologies - Confidential 51
  • 52. Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with Îș = k log n – fast sloppy search for nearest cluster, O(d log Îș) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of Îș weighted centroids O(Îș k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(Îș d log k) or O(d log Îș log k) for larger k, looser quality – result is k high-quality centroids ‱ Even the sloppy surrogate may suffice ©MapR Technologies - Confidential 52
  • 53. Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with Îș = k log n – fast sloppy search for nearest cluster, O(d log Îș) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of Îș weighted centroids O(Îș k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(Îș d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids ‱ For many purposes, even the sloppy surrogate may suffice ©MapR Technologies - Confidential 53
  • 54. Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 54
  • 55. Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 55
  • 56. How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > Îș ≈ C log N – Recursively cluster centroids with higher threshold ©MapR Technologies - Confidential 56
  • 58. But Wait, 
  Finding nearest centroid is inner loop  This could take O( d Îș ) per point and Îș can be big  Happily, approximate nearest centroid works fine ©MapR Technologies - Confidential 58
  • 59. Projection Search total ordering! ©MapR Technologies - Confidential 59
  • 60. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is ©MapR Technologies - Confidential 60
  • 62. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (ÎŒs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 62
  • 63. Quality  Ball k-means implementation appears significantly better than simple k-means  Streaming k-means + ball k-means appears to be about as good as ball k-means alone  All evaluations on 20 newsgroups with held-out data  Figure of merit is mean and median squared distance to nearest cluster ©MapR Technologies - Confidential 63
  • 64. Contact Me!  We’re hiring at MapR in US and Europe  MapR software available for research use  Get the code as part of Mahout trunk (or 0.8 very soon)  Contact me at tdunning@maprtech.com or @ted_dunning  Share news with @apachemahout ©MapR Technologies - Confidential 64

Editor's Notes

  1. The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap