Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OCTOBER	
  11-­‐14,	
  2016	
  	
  •	
  	
  BOSTON,	
  MA	
  
Anyone	
  can	
  build	
  a	
  Recsys	
  w/	
  Solr!	
  
Doug	
  Turnbull	
  
Relevance	
  Consultant,	
  OpenSource	
  Co...
I’m now available in
book form!
https://www.manning.com/books/relevant-search
Discount code: relsearch (38% off)
http://op...
field	
  Body	
  
	
  term	
  laser	
  
	
  	
  	
  	
  doc	
  2	
  
	
  <metadata>	
  	
  
	
  	
  	
  	
  doc	
  4	
  
	...
What's the point?
OpenSource Connections
Solr:
-  A general purpose system for looking up content based on features that
d...
TF*IDF -- measuring feature
weight
OpenSource Connections
term I:
doc 0:
freq: 5
doc 1:
freq: 7
doc 3:
freq: 4
term banan:...
Search often stands in for human interactions
I have a craving for a nice
juicy cut of meat. What
might you recommend?
I h...
Searching the market
q=(juiciness:juicy meatiness:meaty)
Modeling arbitrary feature
strength
OpenSource Connections
term juicy:
steak:
juiciness: 5
grapefruit:
juiciness: 7
orange...
TF*IDF -- measuring feature
weight
OpenSource Connections
term juicy:
doc 0:
freq: 5
doc 1:
freq: 7
doc 3:
freq: 4
term me...
Requesting something from my grocer
More juicy Less juicy
More meaty Less meaty
q=meaty juicy
Results: 1.
2.
3.
Recsys also stands in for human interactions
Hi Jane,
Recommend me
something?
Hmm…
<Tom likes limes, what is
similar to li...
recommendations
Use existing properties
of thing to recommend
similar things
juicy
citrus
More like this for
unstructured ...
"Content Based" more-like-these
Use existing properties
of thing to recommend
similar things
juicy
meaty
citrus
http://sol...
Personalization metadata
Index extra data alongside your
products
{
item: "hamburger",
preferred_by_genders: ["m", …],
pre...
But, Jane's intuition transcends
words!
age:30_40
gender:m
Currently we're stuck with predefined labels:
citrus juicy
meat...
What we like often transcends words
There are emergent properties of our world that don't have names
Relative flarglewharb...
What's a flarglewharble?
More flarglewharbilyLess flarglewharbily
fruit orange lemon banana mentos diet coke
tom X
sue X X...
Can search find the flargles?
q=(flargliwharbliness:very)
	
  term	
  flarglewharble:	
  
	
  	
  	
  	
  diet-­‐coke:	
  ...
personfood orange lemon banana mentos diet coke
tom X X
sue X X X X
charlie X X
clare X X
hal x x X
Goes together
flarglew...
What's the point?
Collaborative filtering
Latent vocabulary
(the flarglewharbles)
Pure Search
Content-based Recs
Predefine...
Can Solr discover the latent/
emergent vocabularies?
Well first let's tell Solr about our users
{
user: "Sue"
foods_bought...
Faceting?
We need a way to look across users and look for patterns
(analyze all the baskets that contain mentos)
q=foods_b...
Counts don't work: importance of
significance
q=foods_bought:mentos&facet=true&facet.field=foods_bought
facets:
mentos: 3
...
Streaming Expressions
/select?q=*:*&facet=true&facet.field=liked_movies
But there's a new sheriff in town!
One option: we ...
Streaming Expressions
/stream?expr=scoreNodes(facet(...)...)
facet(movielens,
q="*:*",
buckets="liked_movies",
bucketSorts...
Significance with streaming expr
/stream?expr=scoreNodes(facet(...)...)
scoreNodes(
select(
facet(movielens,
q="liked_movi...
Lots of power here
/stream?expr=scoreNodes(facet(...)...)
scoreNodes(
select(
facet(movielens,
q="juiciness_pref:juicy",
b...
Only glimpsing the underlying
pattern...
We're not enumerating the flarglewharbles, and the schlumblefumbles
More flarglew...
Coming soon (Solr 6.3)
http://yonik.com/solr-6-3/
https://issues.apache.org/jira/browse/SOLR-9258
-  Models for training c...
Questions?
The Flarglewharbles
Upcoming SlideShare
Loading in …5
×

Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbull, OpenSource Connections

2,983 views

Published on

Presented at Lucene/Solr Revolution 2016

Published in: Technology
  • Be the first to comment

Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbull, OpenSource Connections

  1. 1. OCTOBER  11-­‐14,  2016    •    BOSTON,  MA  
  2. 2. Anyone  can  build  a  Recsys  w/  Solr!   Doug  Turnbull   Relevance  Consultant,  OpenSource  ConnecIons  
  3. 3. I’m now available in book form! https://www.manning.com/books/relevant-search Discount code: relsearch (38% off) http://opensourceconnections.com/about-us/doug-turnbull/ Me The company...
  4. 4. field  Body    term  laser          doc  2    <metadata>            doc  4    <metadata>        term  light          doc  2        <metadata>    term  lightsaber          doc  0   How do search engines work? The answer can be found in your textbook… OpenSource Connections Book Index: •  Topics -> page no •  Very efficient tool – compare to scanning the whole book! Lucene uses an index: •  Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7]
  5. 5. What's the point? OpenSource Connections Solr: -  A general purpose system for looking up content based on features that describe them Tokens aren't really words! doc0: "I like the bananas" Analysis Analysis term I: doc 0 term lik doc 0 term banan: doc 0 [lik] [banan]Search: "liked banana?" [I] [lik] [banan]
  6. 6. TF*IDF -- measuring feature weight OpenSource Connections term I: doc 0: freq: 5 doc 1: freq: 7 doc 3: freq: 4 term banan: doc 0: freq: 2 "Banana-ness" is pretty special "I-ness" is not special doc0: tf==5 df==3 (raw) TF*IDF = 5/3 = 1.6667 doc0: tf==2 df==1 (raw) TF*IDF = 2/1 = 2.0 Search is really distributed feature matching and similarity (text-oriented)
  7. 7. Search often stands in for human interactions I have a craving for a nice juicy cut of meat. What might you recommend? I have JUST the thing!
  8. 8. Searching the market q=(juiciness:juicy meatiness:meaty)
  9. 9. Modeling arbitrary feature strength OpenSource Connections term juicy: steak: juiciness: 5 grapefruit: juiciness: 7 orange: juiciness: 4 term meaty: burger: meatiness: 2 What you want: { item: "steak", juiciness: ["juicy", "juicy", "juicy"], meatiness: ["meaty"] } Use term frequency as feature strength: { item: "grapefruit", juiciness: ["juicy", "juicy", "juicy", "juicy", "juicy"], meatiness: [""] } (remember, Solr doesn't need to store this)
  10. 10. TF*IDF -- measuring feature weight OpenSource Connections term juicy: doc 0: freq: 5 doc 1: freq: 7 doc 3: freq: 4 term meaty: doc 0: freq: 2 "meaty-ness" is pretty special "juicy-ness" is pretty non-special doc0: tf==5 df==3 (raw) TF*IDF = 5/3 = 1.6667 doc0: tf==2 df==1 (raw) TF*IDF = 2/1 = 2.0 Search is really distributed feature matching and similarity
  11. 11. Requesting something from my grocer More juicy Less juicy More meaty Less meaty q=meaty juicy Results: 1. 2. 3.
  12. 12. Recsys also stands in for human interactions Hi Jane, Recommend me something? Hmm… <Tom likes limes, what is similar to limes?>
  13. 13. recommendations Use existing properties of thing to recommend similar things juicy citrus More like this for unstructured data What features/tokens are most representative of this thing? http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like juicy citrus (search) Here's some ideas... { item: "lime", juiciness: ["juicy", "juicy", "juicy"], citrusness: ["citrus", "citrus", "citrus"], meatiness: [""], partyness: ["party"] }
  14. 14. "Content Based" more-like-these Use existing properties of thing to recommend similar things juicy meaty citrus http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like Here's some ideas... Jane knows a few more things that Tom likes...
  15. 15. Personalization metadata Index extra data alongside your products { item: "hamburger", preferred_by_genders: ["m", …], preferred_by_ages: ["30_40"] } age:30_40 gender:m http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like Here's some ideas... Jane knows a few things about Tom (30 yr old male)
  16. 16. But, Jane's intuition transcends words! age:30_40 gender:m Currently we're stuck with predefined labels: citrus juicy meaty We're curating using known vocabularies (can we describe everything?)
  17. 17. What we like often transcends words There are emergent properties of our world that don't have names Relative flarglewharbliness More flarglewharbilyLess flarglewharbily Diet Coke
  18. 18. What's a flarglewharble? More flarglewharbilyLess flarglewharbily fruit orange lemon banana mentos diet coke tom X sue X X X charlie X X clare X X hal x x Goes together Diet Coke
  19. 19. Can search find the flargles? q=(flargliwharbliness:very)  term  flarglewharble:          diet-­‐coke:              flargleness:  4          mentos:              flargleness:  3          banana              flargleness:  1       Can we somehow build? Diet Coke
  20. 20. personfood orange lemon banana mentos diet coke tom X X sue X X X X charlie X X clare X X hal x x X Goes together flarglewharble! Babies often use made-up words based on emergent patterns in their universe They are less committed to our language
  21. 21. What's the point? Collaborative filtering Latent vocabulary (the flarglewharbles) Pure Search Content-based Recs Predefined vocabulary Can Solr discover the latent/ emergent vocabularies?
  22. 22. Can Solr discover the latent/ emergent vocabularies? Well first let's tell Solr about our users { user: "Sue" foods_bought: ["lemon", "banana", "mentos", "diet coke"] } { user: "Charlie" foods_bought: ["banana", "mentos", "diet coke"] }
  23. 23. Faceting? We need a way to look across users and look for patterns (analyze all the baskets that contain mentos) q=foods_bought:mentos&facet=true&facet.field=foods_bought facets: mentos: 3 diet-coke: 3 banana: 2 Hmm: -  Bananas are globally popular -  Diet-coke is probably what matters
  24. 24. Counts don't work: importance of significance q=foods_bought:mentos&facet=true&facet.field=foods_bought facets: mentos: 3 diet-coke: 3 banana: 2 Diet Coke: Global popularity: diet coke (3) Local popularity: 3 Score: 3/3 = 1 Banana: Global popularity: banana (4) Local popularity: 2 Score: 2/4 = 0.5 by-significance: diet-coke: 1 banana: 0.5
  25. 25. Streaming Expressions /select?q=*:*&facet=true&facet.field=liked_movies But there's a new sheriff in town! One option: we could go about and gather global doc freqs & compare those locally. Terms component another option… plugins... Streaming expressions -- distributed stream computation system on top of Solr Cloud You must ALWAYS cross the streams!
  26. 26. Streaming Expressions /stream?expr=scoreNodes(facet(...)...) facet(movielens, q="*:*", buckets="liked_movies", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)) Faceting with Streaming Expressions: Output: { "result-set": {"docs":[ { "count(*)":55807, "liked_movies":"318"}, { "count(*)":52352, "liked_movies":"296"}, { "count(*)":50114, "liked_movies":"593"} Nodes to be transformed
  27. 27. Significance with streaming expr /stream?expr=scoreNodes(facet(...)...) scoreNodes( select( facet(movielens, q="liked_movies:2571 OR liked_movies:4993", buckets="liked_movies", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)), liked_movies as node, "count(*)", replace(collection, null, withValue=movielens), replace(field, null, withValue=liked_movies)) ) 1.  facet (just like above, just with streaming expr) 2.  select to format data for scoreNodes 3.  scoreNodes to score using TF*IDF Banana occurs in 2 documents here, 4 globally -- 2/4 = 0.5 Diet coke occurs 2 documents here, 2 globally -- 2/2 = 1.0 Thinking back on my shoppers behaviors, here's some other items you might like: (thanks Joel Bernstein!) Diet Coke
  28. 28. Lots of power here /stream?expr=scoreNodes(facet(...)...) scoreNodes( select( facet(movielens, q="juiciness_pref:juicy", buckets="liked_movies", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)), liked_movies as node, "count(*)", replace(collection, null, withValue=movielens), replace(field, null, withValue=liked_movies)) ) Find users that like juicy things, what do they like? Perhaps bucket over the aisle they like? Construct our query to focus on a date range? Many insights (thanks Joel Bernstein!)
  29. 29. Only glimpsing the underlying pattern... We're not enumerating the flarglewharbles, and the schlumblefumbles More flarglewharbilyLess flarglewharbily Diet Coke More schlumblewumblyLess schumblewumbly Diet Coke
  30. 30. Coming soon (Solr 6.3) http://yonik.com/solr-6-3/ https://issues.apache.org/jira/browse/SOLR-9258 -  Models for training classifiers -  Then in turn updating documents Progress is being made! -  Clustering?
  31. 31. Questions? The Flarglewharbles

×