© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 2
© 2014 MapR Technologies 3 
Who I am 
Ted Dunning, Chief Applications Architect, MapR Technologies 
Email tdunning@mapr.com tdunning@apache.org 
Twitter @Ted_Dunning 
Apache Mahout https://mahout.apache.org/ 
Twitter @ApacheMahout
© 2014 MapR Technologies 4 
The outline 
• The first open source, big data project 
• Another big data project 
• Conclusions
© 2014 MapR Technologies 5 
First: 
An apology for going 
off-script
© 2014 MapR Technologies 6 
Now, the story
© 2014 MapR Technologies 7
In 1866, the top finishers 
in the tea race reached 
London in 99 days, within 
2 hours of each other 
© 2014 MapR Technologies 8
© 2014 MapR Technologies 9
But in 1851, the record 
had been set at 89 days 
by the Flying Cloud 
© 2014 MapR Technologies 10
© 2014 MapR Technologies 11 
The difference was due 
(in part) 
to big data
© 2014 MapR Technologies 12
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14
These charts were free … 
If you donated your data 
© 2014 MapR Technologies 15
But how does this apply 
today? 
© 2014 MapR Technologies 16
© 2014 MapR Technologies 17 
Key Points of Maury’s Work 
• Give to get 
– Give the Abstract Log to captains, get data 
• Data consortium wins 
– Merging data gives pictures nobody else can see 
• Give back 
– Them that gives, also gets 
• But this is just what every data driven web site does! 
– Just 150 years before everybody else
© 2014 MapR Technologies 18 
The Real News in Behavioral Analysis 
• Everybody knows that: 
• You need ensembles of many models to do recommendations 
• You need to use factorization models 
• You predict what you observe 
• (You should predict ratings)
© 2014 MapR Technologies 19 
But … 
none of this is really true
© 2014 MapR Technologies 20 
In fact, 
• Fancy models are rarely useful expenditures of time 
• Factorization can be good, but not much better (if at all) 
• Ratings are disastrously bad data 
• Cross-recommendation and multi-modal recommendations are 
much more interesting 
– Multiple kinds of input are far better than multiple models 
• The UI has a far larger impact than the models 
• The best algorithms combine simplicity with accuracy 
– So simple you can embed them in a search engine
© 2014 MapR Technologies 21 
Here’s how
© 2014 MapR Technologies 22 
CooccurrenCcoeoAcncaulryrseisn ce Analysis
© 2014 MapR Technologies 23 
How Often Do Items Co-occur 
How often do items co-occur?
© 2014 MapR Technologies 24 
Which Co-occurrences are Interesting? 
Which cooccurences are interesting? 
Each row of indicators becomes a field in a 
search engine document
© 2014 MapR Technologies 25 
Recommendations 
Alice got an apple and 
Alice a puppy
© 2014 MapR Technologies 26 
Recommendations 
Alice got an apple and 
Alice a puppy 
Charles Charles got a bicycle
© 2014 MapR Technologies 27 
Recommendations 
Alice got an apple and 
Alice a puppy 
Bob Bob got an apple 
Charles Charles got a bicycle
© 2014 MapR Technologies 28 
Recommendations 
Alice got an apple and 
Alice a puppy 
Bob What else would Bob like? 
Charles Charles got a bicycle
© 2014 MapR Technologies 29 
Recommendations 
Alice got an apple and 
Alice a puppy 
Bob A puppy! 
Charles Charles got a bicycle
© 2014 MapR Technologies 30 
You get the idea of how 
recommenders can work…
By the way, like me, Bob also 
wants a pony… 
© 2014 MapR Technologies 31
© 2014 MapR Technologies 32 
Recommendations 
? 
Alice 
Bob 
Amelia 
Charles 
What if everybody gets a 
pony? 
What else would you recommend for 
new user Amelia?
© 2014 MapR Technologies 33 
Recommendations 
? 
Alice 
Bob 
Amelia 
Charles 
If everybody gets a pony, it’s 
not a very good indicator of 
what to else predict...
© 2014 MapR Technologies 34 
Problems with Raw Co-occurrence 
• Very popular items co-occur with everything or why it’s not very 
helpful to know that everybody wants a pony… 
– Examples: Welcome document; Elevator music 
• Very widespread occurrence is not interesting to generate indicators 
for recommendation 
– Unless you want to offer an item that is constantly desired, such as 
razor blades (or ponies) 
• What we want is anomalous co-occurrence 
– This is the source of interesting indicators of preference on which to 
base recommendation
Overview: Get Useful Indicators from Behaviors 
© 2014 MapR Technologies 35 
1. Use log files to build history matrix of users x items 
– Remember: this history of interactions will be sparse compared to all 
potential combinations 
2. Transform to a co-occurrence matrix of items x items 
3. Look for useful indicators by identifying anomalous co-occurrences to 
make an indicator matrix 
– Log Likelihood Ratio (LLR) can be helpful to judge which co-occurrences 
can with confidence be used as indicators of preference 
– ItemSimilarityJob in Apache Mahout uses LLR
Which one is the anomalous co-occurrence? 
A not A 
B 1 0 
not B 0 2 
© 2014 MapR Technologies 36 
A not A 
B 13 1000 
not B 1000 100,000 
A not A 
B 1 0 
not B 0 10,000 
A not A 
B 10 0 
not B 0 100,000
Which one is the anomalous co-occurrence? 
A not A 
0.90 1.95 
B 1 0 
not B 0 2 
© 2014 MapR Technologies 37 
A not A 
B 13 1000 
not B 1000 100,000 
A not A 
4.52 14.3 
B 1 0 
not B 0 10,000 
A not A 
B 10 0 
not B 0 100,000
Collection of Documents: Insert Meta-Data 
© 2014 MapR Technologies 38 
Search 
Technology 
Item 
meta-data 
Ingest easily via NFS 
Document for 
“puppy” id: t4 
title: puppy 
desc: The sweetest little puppy 
ever. 
keywords: puppy, dog, pet
From Indicator Matrix to New Indicator Field 
© 2014 MapR Technologies 39 
✔ 
id: t4 
title: puppy 
desc: The sweetest little puppy 
ever. 
keywords: puppy, dog, pet 
indicators: (t1) 
Solr document 
for “puppy” 
Note: data for the indicator field is added directly to meta-data for a document in Apache 
Solr or Elastic Search index. You don’t need to create a separate index for the indicators.
Going Further: Multi-Modal Recommendation 
© 2014 MapR Technologies 40
Going Further: Multi-Modal Recommendation 
© 2014 MapR Technologies 41
© 2014 MapR Technologies 42 
For example 
• Users enter queries (A) 
– (actor = user, item=query) 
• Users view videos (B) 
– (actor = user, item=video) 
• ATA gives query recommendation 
– “did you mean to ask for” 
• BTB gives video recommendation 
– “you might like these videos”
© 2014 MapR Technologies 43 
The punch-line 
• BTA recommends videos in response to a query 
– (isn’t that a search engine?) 
– (not quite, it doesn’t look at content or meta-data)
© 2014 MapR Technologies 44 
Real-life example 
• Query: “Paco de Lucia” 
• Conventional meta-data search results: 
– “hombres de paco” times 400 
– not much else 
• Recommendation based search: 
– Flamenco guitar and dancers 
– Spanish and classical guitar 
– Van Halen doing a classical/flamenco riff
© 2014 MapR Technologies 45 
Real-life example
© 2014 MapR Technologies 46 
Hypothetical Example 
• Want a navigational ontology? 
• Just put labels on a web page with traffic 
– This gives A = users x label clicks 
• Remember viewing history 
– This gives B = users x items 
• Cross recommend 
– B’A = label to item mapping 
• After several users click, results are whatever users think they 
should be
available for free at 
http://www.mapr.com/practical-machine-learning 
© 2014 MapR Technologies 47 
More Details Available 
available for free at 
http://www.mapr.com/practical-machine-learning
© 2014 MapR Technologies 48 
Who I am 
Ted Dunning, Chief Applications Architect, MapR Technologies 
Email tdunning@mapr.com tdunning@apache.org 
Twitter @Ted_Dunning 
Apache Mahout https://mahout.apache.org/ 
Twitter @ApacheMahout
© 2014 MapR Technologies 49 
Q & A 
Engage with us! 
@mapr maprtech 
jbates@mapr.com 
MapR 
maprtech 
mapr-technologies

Cognitive computing with big data, high tech and low tech approaches

  • 1.
    © 2014 MapRTechno©lo 2g0ie1s4 MapR Technologies 1
  • 2.
    © 2014 MapRTechno©lo 2g0ie1s4 MapR Technologies 2
  • 3.
    © 2014 MapRTechnologies 3 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 4.
    © 2014 MapRTechnologies 4 The outline • The first open source, big data project • Another big data project • Conclusions
  • 5.
    © 2014 MapRTechnologies 5 First: An apology for going off-script
  • 6.
    © 2014 MapRTechnologies 6 Now, the story
  • 7.
    © 2014 MapRTechnologies 7
  • 8.
    In 1866, thetop finishers in the tea race reached London in 99 days, within 2 hours of each other © 2014 MapR Technologies 8
  • 9.
    © 2014 MapRTechnologies 9
  • 10.
    But in 1851,the record had been set at 89 days by the Flying Cloud © 2014 MapR Technologies 10
  • 11.
    © 2014 MapRTechnologies 11 The difference was due (in part) to big data
  • 12.
    © 2014 MapRTechnologies 12
  • 13.
    © 2014 MapRTechnologies 13
  • 14.
    © 2014 MapRTechnologies 14
  • 15.
    These charts werefree … If you donated your data © 2014 MapR Technologies 15
  • 16.
    But how doesthis apply today? © 2014 MapR Technologies 16
  • 17.
    © 2014 MapRTechnologies 17 Key Points of Maury’s Work • Give to get – Give the Abstract Log to captains, get data • Data consortium wins – Merging data gives pictures nobody else can see • Give back – Them that gives, also gets • But this is just what every data driven web site does! – Just 150 years before everybody else
  • 18.
    © 2014 MapRTechnologies 18 The Real News in Behavioral Analysis • Everybody knows that: • You need ensembles of many models to do recommendations • You need to use factorization models • You predict what you observe • (You should predict ratings)
  • 19.
    © 2014 MapRTechnologies 19 But … none of this is really true
  • 20.
    © 2014 MapRTechnologies 20 In fact, • Fancy models are rarely useful expenditures of time • Factorization can be good, but not much better (if at all) • Ratings are disastrously bad data • Cross-recommendation and multi-modal recommendations are much more interesting – Multiple kinds of input are far better than multiple models • The UI has a far larger impact than the models • The best algorithms combine simplicity with accuracy – So simple you can embed them in a search engine
  • 21.
    © 2014 MapRTechnologies 21 Here’s how
  • 22.
    © 2014 MapRTechnologies 22 CooccurrenCcoeoAcncaulryrseisn ce Analysis
  • 23.
    © 2014 MapRTechnologies 23 How Often Do Items Co-occur How often do items co-occur?
  • 24.
    © 2014 MapRTechnologies 24 Which Co-occurrences are Interesting? Which cooccurences are interesting? Each row of indicators becomes a field in a search engine document
  • 25.
    © 2014 MapRTechnologies 25 Recommendations Alice got an apple and Alice a puppy
  • 26.
    © 2014 MapRTechnologies 26 Recommendations Alice got an apple and Alice a puppy Charles Charles got a bicycle
  • 27.
    © 2014 MapRTechnologies 27 Recommendations Alice got an apple and Alice a puppy Bob Bob got an apple Charles Charles got a bicycle
  • 28.
    © 2014 MapRTechnologies 28 Recommendations Alice got an apple and Alice a puppy Bob What else would Bob like? Charles Charles got a bicycle
  • 29.
    © 2014 MapRTechnologies 29 Recommendations Alice got an apple and Alice a puppy Bob A puppy! Charles Charles got a bicycle
  • 30.
    © 2014 MapRTechnologies 30 You get the idea of how recommenders can work…
  • 31.
    By the way,like me, Bob also wants a pony… © 2014 MapR Technologies 31
  • 32.
    © 2014 MapRTechnologies 32 Recommendations ? Alice Bob Amelia Charles What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 33.
    © 2014 MapRTechnologies 33 Recommendations ? Alice Bob Amelia Charles If everybody gets a pony, it’s not a very good indicator of what to else predict...
  • 34.
    © 2014 MapRTechnologies 34 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
  • 35.
    Overview: Get UsefulIndicators from Behaviors © 2014 MapR Technologies 35 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co-occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
  • 36.
    Which one isthe anomalous co-occurrence? A not A B 1 0 not B 0 2 © 2014 MapR Technologies 36 A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000
  • 37.
    Which one isthe anomalous co-occurrence? A not A 0.90 1.95 B 1 0 not B 0 2 © 2014 MapR Technologies 37 A not A B 13 1000 not B 1000 100,000 A not A 4.52 14.3 B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000
  • 38.
    Collection of Documents:Insert Meta-Data © 2014 MapR Technologies 38 Search Technology Item meta-data Ingest easily via NFS Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet
  • 39.
    From Indicator Matrixto New Indicator Field © 2014 MapR Technologies 39 ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Solr document for “puppy” Note: data for the indicator field is added directly to meta-data for a document in Apache Solr or Elastic Search index. You don’t need to create a separate index for the indicators.
  • 40.
    Going Further: Multi-ModalRecommendation © 2014 MapR Technologies 40
  • 41.
    Going Further: Multi-ModalRecommendation © 2014 MapR Technologies 41
  • 42.
    © 2014 MapRTechnologies 42 For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • ATA gives query recommendation – “did you mean to ask for” • BTB gives video recommendation – “you might like these videos”
  • 43.
    © 2014 MapRTechnologies 43 The punch-line • BTA recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 44.
    © 2014 MapRTechnologies 44 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 45.
    © 2014 MapRTechnologies 45 Real-life example
  • 46.
    © 2014 MapRTechnologies 46 Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 47.
    available for freeat http://www.mapr.com/practical-machine-learning © 2014 MapR Technologies 47 More Details Available available for free at http://www.mapr.com/practical-machine-learning
  • 48.
    © 2014 MapRTechnologies 48 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 49.
    © 2014 MapRTechnologies 49 Q & A Engage with us! @mapr maprtech jbates@mapr.com MapR maprtech mapr-technologies

Editor's Notes

  • #36 Mention that the Pony book said “RowSimilarityJob”…
  • #43 Problem starts here…