Which Algorithms Really Matter

T
Ted DunningSoftware Engineer at MapR Technologies
Which Algorithms Really Matter?

©MapR Technologies 2013

1
Me, Us


Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG



MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s



Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

©MapR Technologies 2013

2
Topic For Today


What is important? What is not?



Why?



What is the difference from academic research?



Some examples

©MapR Technologies 2013

4
What is Important?


Deployable



Robust



Transparent



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

5
What is Important?


Deployable
–

Clever prototypes don’t count if they can’t be standardized



Robust



Transparent



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

6
What is Important?


Deployable
–



Robust
–



Clever prototypes don’t count
Mishandling is common

Transparent
–

Will degradation be obvious?



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

7
What is Important?


Deployable
–



Robust
–



Will degradation be obvious?

Skillset and mindset matched?
–



Mishandling is common

Transparent
–



Clever prototypes don’t count

How long will your fancy data scientist enjoy doing standard ops tasks?

Proportionate
–

Where is the highest value per minute of effort?

©MapR Technologies 2013

8
Academic Goals vs Pragmatics


Academic goals
–
–

–



Reproducible
Isolate theoretically important aspects
Work on novel problems

Pragmatics
–
–
–
–
–

Highest net value
Available data is constantly changing
Diligence and consistency have larger impact than cleverness
Many systems feed themselves, exploration and exploitation are both
important
Engineering constraints on budget and schedule

©MapR Technologies 2013

9
Example 1:
Making Recommendations Better

©MapR Technologies 2013

10
Recommendation Advances


What are the most important algorithmic advances in
recommendations over the last 10 years?



Cooccurrence analysis?



Matrix completion via factorization?



Latent factor log-linear models?



Temporal dynamics?

©MapR Technologies 2013

11
The Winner – None of the Above


What are the most important algorithmic advances in
recommendations over the last 10 years?

1. Result dithering
2. Anti-flood

©MapR Technologies 2013

12
The Real Issues


Exploration



Diversity



Speed



Not the last fraction of a percent

©MapR Technologies 2013

13
Result Dithering


Dithering is used to re-order recommendation results
–

Re-ordering is done randomly



Dithering is guaranteed to make off-line performance worse



Dithering also has a near perfect record of making actual
performance much better

©MapR Technologies 2013

14
Result Dithering


Dithering is used to re-order recommendation results
–

Re-ordering is done randomly



Dithering is guaranteed to make off-line performance worse



Dithering also has a near perfect record of making actual
performance much better

“Made more difference than any other change”
©MapR Technologies 2013

15
Simple Dithering Algorithm


Generate synthetic score from log rank plus Gaussian

s = logr + N(0, e )


Pick noise scale to provide desired level of mixing

Dr µ r exp e


Typically

e Î [ 0.4, 0.8]


Oh… use floor(t/T) as seed

©MapR Technologies 2013

16
Example … ε = 0.5
1
1
1
1
1
1
1
2
4
2
3
2
©MapR Technologies 2013

2
2
4
2
6
2
2
1
1
1
1
1

6
3
3
4
2
3
3
3
2
5
5
3

5
8
2
3
3
5
4
5
7
3
4
4

3
5
6
15
4
24
6
7
3
4
2
7
17

4
7
7
7
16
7
12
6
9
7
7
12

13
6
11
13
9
17
5
4
8
13
8
17

16
34
10
19
5
13
14
17
5
6
6
16
Example … ε = log 2 = 0.69
1
1
1
1
1
1
1
2
2
3
11
1
©MapR Technologies 2013

2
8
3
2
5
2
3
4
3
4
1
8

8
14
8
10
33
7
5
11
1
1
2
7

3
15
2
7
15
3
23
8
4
2
4
3

9
3
10
3
2
5
9
3
6
10
5
22
18

15
2
5
8
9
4
7
1
7
11
7
11

7
22
7
6
11
19
4
44
8
15
3
2

6
10
4
14
29
6
2
9
33
14
14
33
Exploring The Second Page

©MapR Technologies 2013

19
Lesson 1:
Exploration is good

©MapR Technologies 2013

20
Example 2:
Bayesian Bandits

©MapR Technologies 2013

21
Bayesian Bandits


Based on Thompson sampling



Very general sequential test



Near optimal regret



Trade-off exploration and exploitation



Possibly best known solution for exploration/exploitation



Incredibly simple

©MapR Technologies 2013

22
Thompson Sampling


Select each shell according to the probability that it is the best



Probability that it is the best can be computed using posterior

é
ù
P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq
ë
û
j


But I promised a simple answer

©MapR Technologies 2013

23
Thompson Sampling – Take 2


Sample θ

q ~ P(q | D)


Pick i to maximize reward

i = argmax E[rj | q ]
j



Record result from using i

©MapR Technologies 2013

24
Fast Convergence
0.12
0.11
0.1
0.09
0.08

regret

0.07
0.06

ε- greedy, ε = 0.05
0.05
0.04

Bayesian Bandit with Gam m a- Norm al

0.03
0.02
0.01
0
0

100

200

300

400

500

600
n

©MapR Technologies 2013

25

700

800

900

1000

1100
Thompson Sampling on Ads

An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
©MapR Technologies 2013

26
Bayesian Bandits versus Result Dithering


Many useful systems are difficult to frame in fully Bayesian form



Thompson sampling cannot be applied without posterior sampling



Can still do useful exploration with dithering



But better to use Thompson sampling if possible

©MapR Technologies 2013

27
Lesson 2:
Exploration is pretty
easy to do and pays
big benefits.

©MapR Technologies 2013

28
Example 3:
On-line Clustering

©MapR Technologies 2013

29
The Problem


K-means clustering is useful for feature extraction or compression



At scale and at high dimension, the desirable number of clusters
increases



Very large number of clusters may require more passes through
the data



Super-linear scaling is generally infeasible

©MapR Technologies 2013

30
The Solution


Sketch-based algorithms produce a sketch of the data



Streaming k-means uses adaptive dp-means to produce this sketch
in the form of many weighted centroids which approximate the
original distribution



The size of the sketch grows very slowly with increasing data size



Many operations such as clustering are well behaved on sketches

Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.

©MapR Technologies 2013

31
An Example

©MapR Technologies 2013

32
An Example

©MapR Technologies 2013

33
The Cluster Proximity Features


Every point can be described by the nearest cluster
–
–



Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
–
–



4.3 bits per point in this case
Significant error that can be decreased (to a point) by increasing number of
clusters

Error is negligible
Unwinds the data into a simple representation

Or we can increase the number of clusters (n fold increase adds log
n bits per point, decreases error by sqrt(n)

©MapR Technologies 2013

34
Diagonalized Cluster Proximity

©MapR Technologies 2013

35
Lots of Clusters Are Fine

©MapR Technologies 2013

36
Typical k-means Failure

Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

©MapR Technologies 2013

37
Streaming k-means Ideas


By using a sketch with lots (k log N) of centroids, we avoid
pathological cases



We still get a very good result if the sketch is created
–
–

in one pass
with approximate search



In fact, adaptive dp-means works just fine



In the end, the sketch can be used for clustering or …

©MapR Technologies 2013

38
Lesson 3:
Sketches make big
data small.

©MapR Technologies 2013

39
Example 4:
Search Abuse

©MapR Technologies 2013

40
Recommendations

Alice

Charles

©MapR Technologies 2013

Alice got an apple and a
puppy

Charles got a bicycle

41
Recommendations

Alice

Bob

Charles

©MapR Technologies 2013

Alice got an apple and a
puppy

Bob got an apple

Charles got a bicycle

42
Recommendations

Alice

Bob

?

What else would Bob like?

Charles

©MapR Technologies 2013

43
Log Files
Alice
Charles
Charles
Alice

Alice
Bob
Bob
©MapR Technologies 2013

44
History Matrix: Users by Items

Alice

✔

Bob

✔

Charles

©MapR Technologies 2013

✔

✔
✔
✔

45

✔
Co-occurrence Matrix: Items by Items
How do you tell which co-occurrences are useful?.

1

2

1

1

2

©MapR Technologies 2013

1

0

-

0

1

1
46

0
0
Co-occurrence Binary Matrix

not
not

©MapR Technologies 2013

1
1

47

1
Indicator Matrix: Anomalous Co-Occurrence
Result: The marked row will be added to the indicator
field in the item document…

✔

✔

©MapR Technologies 2013

48
Indicator Matrix
That one row from indicator matrix becomes the indicator field in the Solr
document used to deploy the recommendation engine.

✔
id: t4
title: puppy
desc: The sweetest little puppy ever.
keywords: puppy, dog, pet
indicators:

(t1)

Note: data for the indicator field is added directly to meta-data for a document in
Solr index. You don’t need to create a separate index for the indicators.
©MapR Technologies 2013

49
Internals of the Recommender Engine

50

©MapR Technologies 2013

50
Internals of the Recommender Engine

51

©MapR Technologies 2013

51
Looking Inside LucidWorks
Real-time recommendation query and results: Evaluation

What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?
Recommendation is “1710 : Chuck Berry”
52

©MapR Technologies 2013

52
Real-life example

©MapR Technologies 2013

53
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search

©MapR Technologies 2013

54
Summary

©MapR Technologies 2013

55
©MapR Technologies 2013

56
Me, Us


Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG



MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s



Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

©MapR Technologies 2013

57
1 of 56

Recommended

Hog by
HogHog
HogAnirudh Kanneganti
3.4K views9 slides
Google net by
Google netGoogle net
Google netBrian Kim
4.2K views57 slides
Formation traitement d_images by
Formation traitement d_imagesFormation traitement d_images
Formation traitement d_imagesCynapsys It Hotspot
4.9K views25 slides
TeraStream for ETL by
TeraStream for ETLTeraStream for ETL
TeraStream for ETL치민 최
8.6K views33 slides
Anomaly detection by
Anomaly detectionAnomaly detection
Anomaly detection철 김
3.3K views40 slides
Credit Card Fraud Detection Tutorial by
Credit Card Fraud Detection TutorialCredit Card Fraud Detection Tutorial
Credit Card Fraud Detection TutorialKNIMESlides
4.1K views19 slides

More Related Content

What's hot

Transformers in Vision: From Zero to Hero by
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
1.8K views64 slides
Vision et traitement d'images by
Vision et traitement d'imagesVision et traitement d'images
Vision et traitement d'imagesWided Miled
339 views197 slides
Image Segmentation: Approaches and Challenges by
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesApache MXNet
237 views19 slides
PAC-Bayesian Bound for Deep Learning by
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningMark Chang
539 views21 slides
Information visualization: interaction by
Information visualization: interactionInformation visualization: interaction
Information visualization: interactionKatrien Verbert
2.6K views116 slides
Curse of Dimensionality and Big Data by
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big DataStephane Marchand-Maillet
2.9K views120 slides

What's hot(20)

Transformers in Vision: From Zero to Hero by Bill Liu
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu1.8K views
Vision et traitement d'images by Wided Miled
Vision et traitement d'imagesVision et traitement d'images
Vision et traitement d'images
Wided Miled339 views
Image Segmentation: Approaches and Challenges by Apache MXNet
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and Challenges
Apache MXNet237 views
PAC-Bayesian Bound for Deep Learning by Mark Chang
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep Learning
Mark Chang539 views
Information visualization: interaction by Katrien Verbert
Information visualization: interactionInformation visualization: interaction
Information visualization: interaction
Katrien Verbert2.6K views
Practical Machine Learning and Rails Part1 by ryanstout
Practical Machine Learning and Rails Part1Practical Machine Learning and Rails Part1
Practical Machine Learning and Rails Part1
ryanstout4K views
Final thesis presentation by Pawan Singh
Final thesis presentationFinal thesis presentation
Final thesis presentation
Pawan Singh857 views
Machine learning with graph by Ding Li
Machine learning with graphMachine learning with graph
Machine learning with graph
Ding Li192 views
Convolutional neural network by Ferdous ahmed
Convolutional neural networkConvolutional neural network
Convolutional neural network
Ferdous ahmed554 views
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef... by PyData
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData11.6K views
PhD Thesis Proposal by Ziqiang Feng
PhD Thesis Proposal PhD Thesis Proposal
PhD Thesis Proposal
Ziqiang Feng475 views
Anomaly detection - TIBCO Data Science Central by Michael O'Connell
Anomaly detection - TIBCO Data Science CentralAnomaly detection - TIBCO Data Science Central
Anomaly detection - TIBCO Data Science Central
Michael O'Connell365 views
Deep-Learning Based Stereo Super-Resolution by NAVER Engineering
Deep-Learning Based Stereo Super-ResolutionDeep-Learning Based Stereo Super-Resolution
Deep-Learning Based Stereo Super-Resolution
NAVER Engineering1.5K views
Single Image Super-Resolution from Transformed Self-Exemplars (CVPR 2015) by Jia-Bin Huang
Single Image Super-Resolution from Transformed Self-Exemplars (CVPR 2015)Single Image Super-Resolution from Transformed Self-Exemplars (CVPR 2015)
Single Image Super-Resolution from Transformed Self-Exemplars (CVPR 2015)
Jia-Bin Huang4.3K views
Multi objective optimization and Benchmark functions result by Piyush Agarwal
Multi objective optimization and Benchmark functions resultMulti objective optimization and Benchmark functions result
Multi objective optimization and Benchmark functions result
Piyush Agarwal2.8K views
Object Detection Using R-CNN Deep Learning Framework by Nader Karimi
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
Nader Karimi898 views

Similar to Which Algorithms Really Matter

How to Determine which Algorithms Really Matter by
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
1.2K views56 slides
Mahout and Recommendations by
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
1.8K views61 slides
DFW Big Data talk on Mahout Recommenders by
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersTed Dunning
1.6K views61 slides
How to tell which algorithms really matter by
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
1.1K views54 slides
CMU Lecture on Hadoop Performance by
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
552 views44 slides
Predictive Analytics with Hadoop by
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
15K views65 slides

Similar to Which Algorithms Really Matter(20)

How to Determine which Algorithms Really Matter by DataWorks Summit
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit1.2K views
Mahout and Recommendations by Ted Dunning
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
Ted Dunning1.8K views
DFW Big Data talk on Mahout Recommenders by Ted Dunning
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout Recommenders
Ted Dunning1.6K views
How to tell which algorithms really matter by DataWorks Summit
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
DataWorks Summit1.1K views
Whats Right and Wrong with Apache Mahout by Ted Dunning
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
Ted Dunning6.1K views
What's Right and Wrong with Apache Mahout by MapR Technologies
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache Mahout
MapR Technologies1.1K views
Goto amsterdam-2013-skinned by Ted Dunning
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning971 views
Introduction to Mahout by Ted Dunning
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
Ted Dunning5.4K views
Introduction to Mahout given at Twin Cities HUG by MapR Technologies
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
Using Mahout and a Search Engine for Recommendation by Ted Dunning
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
Ted Dunning7.4K views
DataOps: An Agile Method for Data-Driven Organizations by Ellen Friedman
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman2.3K views
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW... by Matt Stubbs
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Matt Stubbs129 views
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time by Ted Dunning
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning2.8K views
Ted Dunning, Chief Application Architect, MapR at MLconf SF by MLconf
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
MLconf3.1K views
Anomaly Detection: How to find what you didn’t know to look for by Ted Dunning
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning766 views

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx by
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
19 views21 slides
How to Get Going with Kubernetes by
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
593 views80 slides
Progress for big data in Kubernetes by
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
473 views82 slides
Streaming Architecture including Rendezvous for Machine Learning by
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
679 views83 slides
Machine Learning Logistics by
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
613 views52 slides
Tensor Abuse - how to reuse machine learning frameworks by
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
883 views24 slides

More from Ted Dunning(20)

Dunning - SIGMOD - Data Economy.pptx by Ted Dunning
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning19 views
How to Get Going with Kubernetes by Ted Dunning
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning593 views
Progress for big data in Kubernetes by Ted Dunning
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning473 views
Streaming Architecture including Rendezvous for Machine Learning by Ted Dunning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning679 views
Machine Learning Logistics by Ted Dunning
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning613 views
Tensor Abuse - how to reuse machine learning frameworks by Ted Dunning
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning883 views
Machine Learning logistics by Ted Dunning
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning3.9K views
T digest-update by Ted Dunning
T digest-updateT digest-update
T digest-update
Ted Dunning1.4K views
Finding Changes in Real Data by Ted Dunning
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning803 views
Where is Data Going? - RMDC Keynote by Ted Dunning
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning545 views
Real time-hadoop by Ted Dunning
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning1.7K views
Cheap learning-dunning-9-18-2015 by Ted Dunning
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning1.8K views
Sharing Sensitive Data Securely by Ted Dunning
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning1.8K views
How the Internet of Things is Turning the Internet Upside Down by Ted Dunning
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning1.7K views
Apache Kylin - OLAP Cubes for SQL on Hadoop by Ted Dunning
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning8.5K views
Dunning time-series-2015 by Ted Dunning
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning1.1K views
Doing-the-impossible by Ted Dunning
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning3.3K views
Anomaly Detection - New York Machine Learning by Ted Dunning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning6.3K views
Cognitive computing with big data, high tech and low tech approaches by Ted Dunning
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning2.6K views
Recommendation Techn by Ted Dunning
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning1.6K views

Recently uploaded

HTTP headers that make your website go faster - devs.gent November 2023 by
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
26 views151 slides
20231123_Camunda Meetup Vienna.pdf by
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
45 views73 slides
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
25 views38 slides
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
23 views15 slides
Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
29 views26 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
418 views92 slides

Recently uploaded(20)

HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views

Which Algorithms Really Matter

  • 1. Which Algorithms Really Matter? ©MapR Technologies 2013 1
  • 2. Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR ©MapR Technologies 2013 2
  • 3. Topic For Today  What is important? What is not?  Why?  What is the difference from academic research?  Some examples ©MapR Technologies 2013 4
  • 4. What is Important?  Deployable  Robust  Transparent  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 5
  • 5. What is Important?  Deployable – Clever prototypes don’t count if they can’t be standardized  Robust  Transparent  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 6
  • 6. What is Important?  Deployable –  Robust –  Clever prototypes don’t count Mishandling is common Transparent – Will degradation be obvious?  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 7
  • 7. What is Important?  Deployable –  Robust –  Will degradation be obvious? Skillset and mindset matched? –  Mishandling is common Transparent –  Clever prototypes don’t count How long will your fancy data scientist enjoy doing standard ops tasks? Proportionate – Where is the highest value per minute of effort? ©MapR Technologies 2013 8
  • 8. Academic Goals vs Pragmatics  Academic goals – – –  Reproducible Isolate theoretically important aspects Work on novel problems Pragmatics – – – – – Highest net value Available data is constantly changing Diligence and consistency have larger impact than cleverness Many systems feed themselves, exploration and exploitation are both important Engineering constraints on budget and schedule ©MapR Technologies 2013 9
  • 9. Example 1: Making Recommendations Better ©MapR Technologies 2013 10
  • 10. Recommendation Advances  What are the most important algorithmic advances in recommendations over the last 10 years?  Cooccurrence analysis?  Matrix completion via factorization?  Latent factor log-linear models?  Temporal dynamics? ©MapR Technologies 2013 11
  • 11. The Winner – None of the Above  What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood ©MapR Technologies 2013 12
  • 12. The Real Issues  Exploration  Diversity  Speed  Not the last fraction of a percent ©MapR Technologies 2013 13
  • 13. Result Dithering  Dithering is used to re-order recommendation results – Re-ordering is done randomly  Dithering is guaranteed to make off-line performance worse  Dithering also has a near perfect record of making actual performance much better ©MapR Technologies 2013 14
  • 14. Result Dithering  Dithering is used to re-order recommendation results – Re-ordering is done randomly  Dithering is guaranteed to make off-line performance worse  Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change” ©MapR Technologies 2013 15
  • 15. Simple Dithering Algorithm  Generate synthetic score from log rank plus Gaussian s = logr + N(0, e )  Pick noise scale to provide desired level of mixing Dr µ r exp e  Typically e Î [ 0.4, 0.8]  Oh… use floor(t/T) as seed ©MapR Technologies 2013 16
  • 16. Example … ε = 0.5 1 1 1 1 1 1 1 2 4 2 3 2 ©MapR Technologies 2013 2 2 4 2 6 2 2 1 1 1 1 1 6 3 3 4 2 3 3 3 2 5 5 3 5 8 2 3 3 5 4 5 7 3 4 4 3 5 6 15 4 24 6 7 3 4 2 7 17 4 7 7 7 16 7 12 6 9 7 7 12 13 6 11 13 9 17 5 4 8 13 8 17 16 34 10 19 5 13 14 17 5 6 6 16
  • 17. Example … ε = log 2 = 0.69 1 1 1 1 1 1 1 2 2 3 11 1 ©MapR Technologies 2013 2 8 3 2 5 2 3 4 3 4 1 8 8 14 8 10 33 7 5 11 1 1 2 7 3 15 2 7 15 3 23 8 4 2 4 3 9 3 10 3 2 5 9 3 6 10 5 22 18 15 2 5 8 9 4 7 1 7 11 7 11 7 22 7 6 11 19 4 44 8 15 3 2 6 10 4 14 29 6 2 9 33 14 14 33
  • 18. Exploring The Second Page ©MapR Technologies 2013 19
  • 19. Lesson 1: Exploration is good ©MapR Technologies 2013 20
  • 20. Example 2: Bayesian Bandits ©MapR Technologies 2013 21
  • 21. Bayesian Bandits  Based on Thompson sampling  Very general sequential test  Near optimal regret  Trade-off exploration and exploitation  Possibly best known solution for exploration/exploitation  Incredibly simple ©MapR Technologies 2013 22
  • 22. Thompson Sampling  Select each shell according to the probability that it is the best  Probability that it is the best can be computed using posterior é ù P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq ë û j  But I promised a simple answer ©MapR Technologies 2013 23
  • 23. Thompson Sampling – Take 2  Sample θ q ~ P(q | D)  Pick i to maximize reward i = argmax E[rj | q ] j  Record result from using i ©MapR Technologies 2013 24
  • 24. Fast Convergence 0.12 0.11 0.1 0.09 0.08 regret 0.07 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 n ©MapR Technologies 2013 25 700 800 900 1000 1100
  • 25. Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011 ©MapR Technologies 2013 26
  • 26. Bayesian Bandits versus Result Dithering  Many useful systems are difficult to frame in fully Bayesian form  Thompson sampling cannot be applied without posterior sampling  Can still do useful exploration with dithering  But better to use Thompson sampling if possible ©MapR Technologies 2013 27
  • 27. Lesson 2: Exploration is pretty easy to do and pays big benefits. ©MapR Technologies 2013 28
  • 28. Example 3: On-line Clustering ©MapR Technologies 2013 29
  • 29. The Problem  K-means clustering is useful for feature extraction or compression  At scale and at high dimension, the desirable number of clusters increases  Very large number of clusters may require more passes through the data  Super-linear scaling is generally infeasible ©MapR Technologies 2013 30
  • 30. The Solution  Sketch-based algorithms produce a sketch of the data  Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution  The size of the sketch grows very slowly with increasing data size  Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan. ©MapR Technologies 2013 31
  • 33. The Cluster Proximity Features  Every point can be described by the nearest cluster – –  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – –  4.3 bits per point in this case Significant error that can be decreased (to a point) by increasing number of clusters Error is negligible Unwinds the data into a simple representation Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n) ©MapR Technologies 2013 34
  • 35. Lots of Clusters Are Fine ©MapR Technologies 2013 36
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together ©MapR Technologies 2013 37
  • 37. Streaming k-means Ideas  By using a sketch with lots (k log N) of centroids, we avoid pathological cases  We still get a very good result if the sketch is created – – in one pass with approximate search  In fact, adaptive dp-means works just fine  In the end, the sketch can be used for clustering or … ©MapR Technologies 2013 38
  • 38. Lesson 3: Sketches make big data small. ©MapR Technologies 2013 39
  • 39. Example 4: Search Abuse ©MapR Technologies 2013 40
  • 40. Recommendations Alice Charles ©MapR Technologies 2013 Alice got an apple and a puppy Charles got a bicycle 41
  • 41. Recommendations Alice Bob Charles ©MapR Technologies 2013 Alice got an apple and a puppy Bob got an apple Charles got a bicycle 42
  • 42. Recommendations Alice Bob ? What else would Bob like? Charles ©MapR Technologies 2013 43
  • 44. History Matrix: Users by Items Alice ✔ Bob ✔ Charles ©MapR Technologies 2013 ✔ ✔ ✔ ✔ 45 ✔
  • 45. Co-occurrence Matrix: Items by Items How do you tell which co-occurrences are useful?. 1 2 1 1 2 ©MapR Technologies 2013 1 0 - 0 1 1 46 0 0
  • 46. Co-occurrence Binary Matrix not not ©MapR Technologies 2013 1 1 47 1
  • 47. Indicator Matrix: Anomalous Co-Occurrence Result: The marked row will be added to the indicator field in the item document… ✔ ✔ ©MapR Technologies 2013 48
  • 48. Indicator Matrix That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators. ©MapR Technologies 2013 49
  • 49. Internals of the Recommender Engine 50 ©MapR Technologies 2013 50
  • 50. Internals of the Recommender Engine 51 ©MapR Technologies 2013 51
  • 51. Looking Inside LucidWorks Real-time recommendation query and results: Evaluation What to recommend if new user listened to 2122: Fats Domino & 303: Beatles? Recommendation is “1710 : Chuck Berry” 52 ©MapR Technologies 2013 52
  • 53. Lesson 4: Recursive search abuse pays Search can implement recs Which can implement search ©MapR Technologies 2013 54
  • 56. Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR ©MapR Technologies 2013 57

Editor's Notes

  1. * A history of what everybody has done. Obviously this is just a cartoon because large numbers of users and interactions with items would be required to build a recommender* Next step will be to predict what a new user might like…
  2. *Bob is the “new user” and getting apple is his history
  3. *Here is where the recommendation engine needs to go to work…Note to trainer: you might see if audience calls out the answer before revealing next slide…
  4. Note to trainer: This is the situation similar to that in which we started, with three users in our history. The difference is that now everybody got a pony. Bob has apple and pony but not a puppy…yet
  5. *Binary matrix is stored sparsely
  6. *Convert by MapReduce into a binary matrixNote to trainer: Whether consider apple to have occurred with self is open question
  7. Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
  8. Only important co-occurrence is puppy follows apple
  9. *Take that row of matrix and combine with all the meta data we might have…*Important thing to get from the co-occurrence matrix is this indicator..Cool thing: analogous to what a lot of recommendation engines do*This row forms the indicator field in a Solr document containing meta-data (you do NOT have to build a separate index for the indicators)Find the useful co-occurrence and get rid of the rest. Sparsify and get the anomalous co-occurrence
  10. Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  11. *This indicator field is where the output of the Mahout recommendation engine are stored (the row from the indicator matrix that identified significant or interesting co-occurrence. *Keep in mind that this recommendation indicator data is added to the same original document in the Solr index that contains meta data for the item in question
  12. This is a diagnostics window in the LucidWorksSolr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine.In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?