SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
© 2014 MapR Technologies 3
Agenda
• Background – recommending with puppies and ponies
• Speed tricks
• Accuracy tricks
• Moving to real-time
© 2014 MapR Technologies 4
Puppies and Ponies
© 2014 MapR Technologies 5
Cooccurrence AnalysisCooccurrence Analysis
© 2014 MapR Technologies 6
How Often Do Items Co-occur
How often do items co-occur?
© 2014 MapR Technologies 7
Which Co-occurrences are Interesting?
Which cooccurences are interesting?
Each row of indicators becomes a field in a
search engine document
© 2014 MapR Technologies 8
Recommendations
Alice got an apple and
a puppyAlice
© 2014 MapR Technologies 9
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
© 2014 MapR Technologies 10
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 11
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob What else would Bob like?
© 2014 MapR Technologies 12
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 13
By the way, like me, Bob also
wants a pony…
© 2014 MapR Technologies 14
Recommendations
?
Alice
Bob
Charles
Amelia
What if everybody gets a
pony?
What else would you recommend for
new user Amelia?
© 2014 MapR Technologies 15
Recommendations
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s
not a very good indicator of
what to else predict...
© 2014 MapR Technologies 16
Problems with Raw Co-occurrence
• Very popular items co-occur with everything or why it’s not very
helpful to know that everybody wants a pony…
– Examples: Welcome document; Elevator music
• Very widespread occurrence is not interesting to generate indicators
for recommendation
– Unless you want to offer an item that is constantly desired, such as
razor blades (or ponies)
• What we want is anomalous co-occurrence
– This is the source of interesting indicators of preference on which to
base recommendation
© 2014 MapR Technologies 17
Overview: Get Useful Indicators from Behaviors
1. Use log files to build history matrix of users x items
– Remember: this history of interactions will be sparse compared to all
potential combinations
2. Transform to a co-occurrence matrix of items x items
3. Look for useful indicators by identifying anomalous co-occurrences to
make an indicator matrix
– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference
– ItemSimilarityJob in Apache Mahout uses LLR
© 2014 MapR Technologies 18
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
© 2014 MapR Technologies 19
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence,
Computational Linguistics vol 19 no. 1 (1993)
© 2014 MapR Technologies 20
Collection of Documents: Insert Meta-Data
Search Engine
Item
meta-data
Document for
“puppy”
id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
Ingest easily via NFS
✔ indicators: (t1)
© 2014 MapR Technologies 22
Cooccurrence Mechanics
• Cooccurrence is just a self-join
for each user, i
for each history item j1 in Ai*
for each history item j2 in Ai*
count pair (j1, j2)
© 2014 MapR Technologies 23
Cross-occurrence Mechanics
• Cross occurrence is just a self-join of adjoined matrices
for each user, i
for each history item j1 in Ai*
for each history item j2 in Bi*
count pair (j1, j2)
© 2014 MapR Technologies 24
A word about scaling
© 2014 MapR Technologies 25
A few pragmatic tricks
• Downsample all user histories to max length (interaction cut)
– Can be random or most-recent (no apparent effect on accuracy)
– Prolific users are often pathological anyway
– Common limit is 300 items (no apparent effect on accuracy)
• Downsample all items to limit max viewers (frequency limit)
– Can be random or earliest (no apparent effect)
– Ubiquitous items are uninformative
– Common limit is 500 users (no apparent effect)
Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce.
Proceedings of the sixth ACM conference on Recommender systems. 2012
© 2014 MapR Technologies 26
But note!
• Number of pairs for a user history with ki distinct items is ≈ ki
2/2
• Average size of user history increases with increasing dataset
– Average may grow more slowly than N (or not!)
– Full cooccurrence cost grows strictly faster than N
– i.e. it just doesn’t scale
• Downsampling interactions places bounds on per user cost
– Cooccurrence with interaction cut is scalable
© 2014 MapR Technologies 27
0 200 400 600 800 1000
0123456
Benefit of down−sampling
User Limit
Pairs(x109
)
Without down−sampling
Track limit = 1000
500
200
Computed on 48,373,586 pair−wise triples
from the million song dataset
●
●
© 2014 MapR Technologies 28
Batch Scaling in Time Implies Scaling in Space
• Note:
– With frequency limit sampling, max cooccurrence count is small (<1000)
– With interaction cut, total number of non-zero pairs is relatively small
– Entire cooccurrence matrix can be stored in memory in ~10-15 GB
• Specifically:
– With interaction cut, cooccurrence scales in size
– Without interaction cut, cooccurrence does not scale size-wise
© 2014 MapR Technologies 29
Impact of Interaction Cut Downsampling
• Interaction cut allows batch cooccurrence analysis to be O(N) in
time and space
• This is intriguing
– Amortized cost is low
– Could this be extended to an on-line form?
• Incremental matrix factorization is hard
– Could cooccurrence be a key alternative?
• Scaling matters most at scale
– Cooccurrence is very accurate at large scale
– Factorization shows benefits at smaller scales
© 2014 MapR Technologies 30
Online update
© 2014 MapR Technologies 31
Requirements for Online Algorithms
• Each unit of input must require O(1) work
– Theoretical bound
• The constants have to be small enough on average
– Pragmatic constraint
• Total accumulated data must be small (enough)
– Pragmatic constraint
© 2014 MapR Technologies 32
Log Files
Search
Technology
Item
Meta-Data
via
NFS
MapR Cluster
via
NFS PostPre
Recommendations
New User
History
Web
Tier
Recommendations
happen in real-time
Batch co-
occurrence
Want this to be real-time
Real-time recommendations using MapR data platform
© 2014 MapR Technologies 33
Space Bound Implies Time Bound
• Because user histories are pruned, only a limited number of
value updates need be made with each new observation
• This bound is just twice the interaction cut kmax
– Which is a constant
• Bounding the number of updates trivially bounds the time
© 2014 MapR Technologies 34
Implications for Online Update
© 2014 MapR Technologies 35
With interaction cut at
© 2014 MapR Technologies 36
But Wait, It Gets Better
• The new observation may be pruned
– For users at the interaction cut, we can ignore updates
– For items at the frequency cut, we can ignore updates
– Ignoring updates only affects indicators, not recommendation query
– At million song dataset size, half of all updates are pruned
• On average ki is much less than the interaction cut
– For million song dataset, average appears to grow with log of frequency
limit, with little dependency on values of interaction cut > 200
• LLR cutoff avoids almost all updates to index
• Average grows slowly with frequency cut
© 2014 MapR Technologies 37
0 200 400 600 800 1000
05101520253035
Interaction cut (kmax)
kave
Frequency cut = 1000
= 500
= 200
© 2014 MapR Technologies 38
0 200 400 600 800 1000
05101520253035
Frequency cut
kave
© 2014 MapR Technologies 39
Recap
• Cooccurrence-based recommendations are simple
– Deploying with a search engine is even better
• Interaction cut and frequency cut are key to batch scalability
• Similar effect occurs in online form of updates
– Only dozens of updates per transaction needed
– Data structure required is relatively small
– Very, very few updates cause search engine updates
• Fully online recommendation is very feasible, almost easy
© 2014 MapR Technologies 40
More Details Available
available for free at
available for free at
http://www.mapr.com/practical-machine-learning
© 2014 MapR Technologies 41
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
Apache Drill http://incubator.apache.org/drill/
Twitter @ApacheDrill
© 2014 MapR Technologies 42
Q&A
@mapr maprtech
tdunning@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot

My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
Ted Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning
 
T digest-update
T digest-updateT digest-update
T digest-update
Ted Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
Ted Dunning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
Ted Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
Ted Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
Ted Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
Ted Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
Ted Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
Ted Dunning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
MapR Technologies
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
Ted Dunning
 

What's hot (20)

My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 

Viewers also liked

A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
Mohamed Elfadly
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Amr Awadallah
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
MapR Technologies
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
MapR Technologies
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
MapR Technologies
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
ervogler
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
MapR Technologies
 

Viewers also liked (20)

A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 

Similar to Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
MLconf
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
MapR Technologies
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
DataWorks Summit
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...
PAPIs.io
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
MapR Technologies
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
Allen Day, PhD
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
DataWorks Summit
 
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity SwitchesNetwork-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity Switches
AJAY KHARAT
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
MapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
John Mulhall
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 

Similar to Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time (20)

Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity SwitchesNetwork-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity Switches
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
Ted Dunning
 

More from Ted Dunning (6)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Recently uploaded

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 

Recently uploaded (20)

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 3. © 2014 MapR Technologies 3 Agenda • Background – recommending with puppies and ponies • Speed tricks • Accuracy tricks • Moving to real-time
  • 4. © 2014 MapR Technologies 4 Puppies and Ponies
  • 5. © 2014 MapR Technologies 5 Cooccurrence AnalysisCooccurrence Analysis
  • 6. © 2014 MapR Technologies 6 How Often Do Items Co-occur How often do items co-occur?
  • 7. © 2014 MapR Technologies 7 Which Co-occurrences are Interesting? Which cooccurences are interesting? Each row of indicators becomes a field in a search engine document
  • 8. © 2014 MapR Technologies 8 Recommendations Alice got an apple and a puppyAlice
  • 9. © 2014 MapR Technologies 9 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles
  • 10. © 2014 MapR Technologies 10 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  • 11. © 2014 MapR Technologies 11 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob What else would Bob like?
  • 12. © 2014 MapR Technologies 12 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  • 13. © 2014 MapR Technologies 13 By the way, like me, Bob also wants a pony…
  • 14. © 2014 MapR Technologies 14 Recommendations ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 15. © 2014 MapR Technologies 15 Recommendations ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict...
  • 16. © 2014 MapR Technologies 16 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
  • 17. © 2014 MapR Technologies 17 Overview: Get Useful Indicators from Behaviors 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
  • 18. © 2014 MapR Technologies 18 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  • 19. © 2014 MapR Technologies 19 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
  • 20. © 2014 MapR Technologies 20 Collection of Documents: Insert Meta-Data Search Engine Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS ✔ indicators: (t1)
  • 21. © 2014 MapR Technologies 22 Cooccurrence Mechanics • Cooccurrence is just a self-join for each user, i for each history item j1 in Ai* for each history item j2 in Ai* count pair (j1, j2)
  • 22. © 2014 MapR Technologies 23 Cross-occurrence Mechanics • Cross occurrence is just a self-join of adjoined matrices for each user, i for each history item j1 in Ai* for each history item j2 in Bi* count pair (j1, j2)
  • 23. © 2014 MapR Technologies 24 A word about scaling
  • 24. © 2014 MapR Technologies 25 A few pragmatic tricks • Downsample all user histories to max length (interaction cut) – Can be random or most-recent (no apparent effect on accuracy) – Prolific users are often pathological anyway – Common limit is 300 items (no apparent effect on accuracy) • Downsample all items to limit max viewers (frequency limit) – Can be random or earliest (no apparent effect) – Ubiquitous items are uninformative – Common limit is 500 users (no apparent effect) Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce. Proceedings of the sixth ACM conference on Recommender systems. 2012
  • 25. © 2014 MapR Technologies 26 But note! • Number of pairs for a user history with ki distinct items is ≈ ki 2/2 • Average size of user history increases with increasing dataset – Average may grow more slowly than N (or not!) – Full cooccurrence cost grows strictly faster than N – i.e. it just doesn’t scale • Downsampling interactions places bounds on per user cost – Cooccurrence with interaction cut is scalable
  • 26. © 2014 MapR Technologies 27 0 200 400 600 800 1000 0123456 Benefit of down−sampling User Limit Pairs(x109 ) Without down−sampling Track limit = 1000 500 200 Computed on 48,373,586 pair−wise triples from the million song dataset ● ●
  • 27. © 2014 MapR Technologies 28 Batch Scaling in Time Implies Scaling in Space • Note: – With frequency limit sampling, max cooccurrence count is small (<1000) – With interaction cut, total number of non-zero pairs is relatively small – Entire cooccurrence matrix can be stored in memory in ~10-15 GB • Specifically: – With interaction cut, cooccurrence scales in size – Without interaction cut, cooccurrence does not scale size-wise
  • 28. © 2014 MapR Technologies 29 Impact of Interaction Cut Downsampling • Interaction cut allows batch cooccurrence analysis to be O(N) in time and space • This is intriguing – Amortized cost is low – Could this be extended to an on-line form? • Incremental matrix factorization is hard – Could cooccurrence be a key alternative? • Scaling matters most at scale – Cooccurrence is very accurate at large scale – Factorization shows benefits at smaller scales
  • 29. © 2014 MapR Technologies 30 Online update
  • 30. © 2014 MapR Technologies 31 Requirements for Online Algorithms • Each unit of input must require O(1) work – Theoretical bound • The constants have to be small enough on average – Pragmatic constraint • Total accumulated data must be small (enough) – Pragmatic constraint
  • 31. © 2014 MapR Technologies 32 Log Files Search Technology Item Meta-Data via NFS MapR Cluster via NFS PostPre Recommendations New User History Web Tier Recommendations happen in real-time Batch co- occurrence Want this to be real-time Real-time recommendations using MapR data platform
  • 32. © 2014 MapR Technologies 33 Space Bound Implies Time Bound • Because user histories are pruned, only a limited number of value updates need be made with each new observation • This bound is just twice the interaction cut kmax – Which is a constant • Bounding the number of updates trivially bounds the time
  • 33. © 2014 MapR Technologies 34 Implications for Online Update
  • 34. © 2014 MapR Technologies 35 With interaction cut at
  • 35. © 2014 MapR Technologies 36 But Wait, It Gets Better • The new observation may be pruned – For users at the interaction cut, we can ignore updates – For items at the frequency cut, we can ignore updates – Ignoring updates only affects indicators, not recommendation query – At million song dataset size, half of all updates are pruned • On average ki is much less than the interaction cut – For million song dataset, average appears to grow with log of frequency limit, with little dependency on values of interaction cut > 200 • LLR cutoff avoids almost all updates to index • Average grows slowly with frequency cut
  • 36. © 2014 MapR Technologies 37 0 200 400 600 800 1000 05101520253035 Interaction cut (kmax) kave Frequency cut = 1000 = 500 = 200
  • 37. © 2014 MapR Technologies 38 0 200 400 600 800 1000 05101520253035 Frequency cut kave
  • 38. © 2014 MapR Technologies 39 Recap • Cooccurrence-based recommendations are simple – Deploying with a search engine is even better • Interaction cut and frequency cut are key to batch scalability • Similar effect occurs in online form of updates – Only dozens of updates per transaction needed – Data structure required is relatively small – Very, very few updates cause search engine updates • Fully online recommendation is very feasible, almost easy
  • 39. © 2014 MapR Technologies 40 More Details Available available for free at available for free at http://www.mapr.com/practical-machine-learning
  • 40. © 2014 MapR Technologies 41 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout Apache Drill http://incubator.apache.org/drill/ Twitter @ApacheDrill
  • 41. © 2014 MapR Technologies 42 Q&A @mapr maprtech tdunning@mapr.com Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  1. Mention that the Pony book said “RowSimilarityJob”…