Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Search & Recommendation Engines
Trey Grainger
SVP of Engineering, Lucidworks
Greenville Data Science
2017.06.29
Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Ge...
what do you do?
Search-Driven
Everything
Customer
Service
Customer
Insights
Fraud Surveillance
Research
Portal
Online Retail
Digital
Conte...
Apache Solr
“Solr is the popular, blazing-fast,
open source enterprise search
platform built on Apache Lucene™.”
Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics (nested / relationa...
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
Lucidworks Fusion
All Your Data
• Over 50 connectors to
integrate all your data
• Robust parsing framework
to seamlessly ingest all your
document types
• ...
Experience
Management
• Relevancy tuning: Point-and-click
query pipeline configuration allow
fine-grained control of results.
• Machine-driven r...
• Seamless integration of your
entire search & analytics
platform
• All capabilities exposed
through secured API's, so
you...
• Modular library of UI components to create
prototypes in hours, not weeks.
• Fine-grained security for government and
Fo...
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader
Election
Load
Balancing
Shared Config
Management
...
Fusion powers search for the brightest companies in the
world.
Lucidworks Fusion
search & relevancy
Basic Keyword Search
The beginning of a typical search journey
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 ...
/solr/select/?q=apache solr
Field Documents
… …
apache doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr doc1, doc...
Text Analysis
Generating terms to index from raw text
Text Analysis in Solr
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming te...
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
...
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
...
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
...
Multi-lingual Text Analysis
Analyzing text across multiple languages
Example English Analysis Chains
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<t...
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Greenville...
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Greenville...
Which Stemmer do I choose?
*From Solr in Action, Chapter 14
Greenville Data Science & Analytics
Common English Stemmers
Greenville Data Science & Analytics
Greenville Data Science & Analytics
Relevancy Ranking
Scoring the results, returning the best matches
Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.get...
Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.get...
• Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Docume...
BM25 (Okapi “Best Match” 25th Iteration)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b...
News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are criti...
*Example from chapter 16 of Solr in Action
Domain-specific relevancy calculation (News Website Example)
News website:
/sel...
Fancy boosting functions (Restaurant Search Example)
Distance (50%) + keywords (30%) + category (20%)
q=_val_:"scale(mul(q...
This is powerful, but feels like
a lot of work to get right…
what is “reflected intelligence”?
The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact wi...
● Recommendation Algorithms
● Building user profiles from past searches, clicks, and other actions
● Identifying correlati...
John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants...
http://localhost:8983/solr/jobs/select/?
fl=jobtitle,city,state,salary&
q=(
jobtitle:"nurse educator"^25 OR jobtitle:(nurs...
{ ...
"response":{"numFound":22,"start":0,"docs":[
{"jobtitle":" Clinical Educator
(New England/ Boston)",
"city":"Boston"...
You just built a
recommendation engine!
Collaborative Filtering
Term Documents
user1 doc1, doc5
user2 doc2
user3 doc2
user4 doc1, doc3,
doc4, doc5
user5 doc1, doc...
Step 1: Find similar users who like the same documents
Document “Users who bought this product” field
doc1 user1, user4, u...
/solr/select/?q=userlikes:("user4"^2
OR "user5"^2 OR "user1"^1)
Step 2: Search for docs “liked” by those similar users
Ter...
Using matrix factorization is typically more efficient
(Ships with Fusion 3.1):
Greenville Data Science & Analytics
Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
Greenville D...
Demo:
Signals & Recommendations
• 200%+ increase in
click-through rates
• 91% lower TCO
• 50,000 fewer support
tickets
• Increased customer
satisfaction
Learning to Rank
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination
of features that provide ...
Greenville Data Science & Analytics
Common LTR Algorithms
• RankNet* (neural networks, boosted trees)
• LambdaMart* (regression trees)
• SVM Rank** (SVM class...
LambdaMart Example
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia...
Demo: Learning to Rank
#1: Pull, Build, Start Solr
git clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solr
ant server
bin/solr...
# Run Searches
http://localhost:8983/solr/techproducts/select?q=ipod
# Supply User Relevancy Judgements
nano contrib/ltr/example/user_queries.txt
#Format: query | doc id | relevancy judgement...
# Re-run Searches using Machine-learned Ranking Model
http://localhost:8984/solr/techproducts/browse?q=ipod
&rq={!ltr mode...
Traditional
Keyword
Search
Recommendations
Semantic
Search
User Intent
Personalized
Search
Augmented
Search
Domain-aware
M...
Streaming Expressions & Graph Traversals
• Perform relational operations on
streams
• Stream sources: search, jdbc, facets,
features, gatherNodes, shortestPath,
tr...
Streaming Expressions - Examples
Shortest-path Graph
Traversal
Parallel Batch
Procesing
Train a Logistic Regression
Model
...
Graph Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Con...
Solr Graph Timeline
• Some data is much more naturally represented as a graph
structure
• Solr 6.0: Introduced the Graph Q...
Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides ...
Graph Streaming Expressions
• Part of Solr’s broader Streaming Expressions capability
• Implements a powerful, breadth-fir...
All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
Greenville Data Sci...
All movies that viewers of a specific movie watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_...
Collaborative Filtering
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens,
top(n="30", sort="count(*) desc",
gat...
Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expr...
Basic Keyword Search
(inverted index, tf-idf, bm25,
query formulation, etc.)
Taxonomies / Entity
Extraction
(entity recogn...
Additional References:
Greenville Data Science & Analytics
Additional References:
Greenville Data Science & Analytics
Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Meetup discount (39% off): 3...
Greenville Data Science & Analytics
Audience Questions
#1: How can you figure out the meaning or intent of
keywords, parti...
How do we handle phrases with ambiguous meanings?
Example Related Keywords (representing multiple meanings)
driver truck d...
A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph
Greenville Data Science & Analytics
Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job ...
Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i....
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterp...
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use...
Greenville Data Science & Analytics
Audience Questions
#2: Can you tell me more about the semantic knowledge
graph?
See:
h...
Building Search & Recommendation Engines
Building Search & Recommendation Engines
Building Search & Recommendation Engines
Building Search & Recommendation Engines
Building Search & Recommendation Engines
Upcoming SlideShare
Loading in …5
×

Building Search & Recommendation Engines

2,610 views

Published on

In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.

Published in: Software
  • Be the first to comment

Building Search & Recommendation Engines

  1. 1. Building Search & Recommendation Engines Trey Grainger SVP of Engineering, Lucidworks Greenville Data Science 2017.06.29
  2. 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Frequent conference speaker • Founder of Celiaccess.com, the gluten-free search engine • Lucene/Solr contributor • Startup Investor / Advisor About Me
  3. 3. what do you do?
  4. 4. Search-Driven Everything Customer Service Customer Insights Fraud Surveillance Research Portal Online Retail Digital Content
  5. 5. Apache Solr
  6. 6. “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.”
  7. 7. Key Solr Features: ● Multilingual Keyword search ● Relevancy Ranking of results ● Faceting & Analytics (nested / relational) ● Highlighting ● Spelling Correction ● Autocomplete/Type-ahead Prediction ● Sorting, Grouping, Deduplication ● Distributed, Fault-tolerant, Scalable ● Geospatial search ● Complex Function queries ● Recommendations (More Like This) ● Graph Queries and Traversals ● SQL Query Support ● Streaming Aggregations ● Batch and Streaming processing ● Highly Configurable / Plugins ● Learning to Rank ● Building machine-learning models ● … many more *source: Solr in Action, chapter 2
  8. 8. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  9. 9. Lucidworks Fusion
  10. 10. All Your Data
  11. 11. • Over 50 connectors to integrate all your data • Robust parsing framework to seamlessly ingest all your document types • Point and click Indexing configuration and iterative simulation of results for full control over your ETL process • Your security model enforced end-to-end from ingest to search across your different datasources
  12. 12. Experience Management
  13. 13. • Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results. • Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box . • Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages. • Turnkey search UI (Lucidworks View): Build a sophisticated end-to-end search application in just hours.
  14. 14. • Seamless integration of your entire search & analytics platform • All capabilities exposed through secured API's, so you can use our UI or build your own. • End-to-end security policies can be applied out of the box to every aspect of your search ecosystem. • Distributed, fault-tolerant scaling and supervision of your entire search application
  15. 15. • Modular library of UI components to create prototypes in hours, not weeks. • Fine-grained security for government and Fortune 500, military, law enforcement, enforcing permissions by item, role, geography or other parameters. • Stateless architecture so apps are robust, easy to deploy, and highly scalable. • Supports over 25 data platforms including Solr, SharePoint, Elasticsearch, Cloudera, Attivio, FAST, MongoDB, and many more - and of course Lucidworks Fusion. • Full library of visualization components for charts, pivots, graphs and more. • Pre-tested re-usable modules include pagination, faceting, geospatial mapping, rich snippets, heatmaps, topic pages, and more. Create custom search and discovery applications in minutes.
  16. 16. SECURITY BUILT-IN Shards Shards Apache Solr Apache Zookeeper ZK 1 Leader Election Load Balancing Shared Config Management Worker Worker Apache Spark Cluster Manager Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Twigkit LOGS FILE WEB DATABASE CLOUD HDFS(Optional) Lucidworks Fusion Architecture
  17. 17. Fusion powers search for the brightest companies in the world.
  18. 18. Lucidworks Fusion
  19. 19. search & relevancy
  20. 20. Basic Keyword Search The beginning of a typical search journey
  21. 21. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): The inverted index Greenville Data Science & Analytics
  22. 22. /solr/select/?q=apache solr Field Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents Greenville Data Science & Analytics
  23. 23. Text Analysis Generating terms to index from raw text
  24. 24. Text Analysis in Solr A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  25. 25. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  26. 26. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  27. 27. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in SolrText Analysis in Solr *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  28. 28. Multi-lingual Text Analysis Analyzing text across multiple languages
  29. 29. Example English Analysis Chains <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Greenville Data Science & Analytics
  30. 30. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solr in Action Greenville Data Science & Analytics
  31. 31. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solr in Action Greenville Data Science & Analytics
  32. 32. Which Stemmer do I choose? *From Solr in Action, Chapter 14 Greenville Data Science & Analytics
  33. 33. Common English Stemmers Greenville Data Science & Analytics
  34. 34. Greenville Data Science & Analytics
  35. 35. Relevancy Ranking Scoring the results, returning the best matches
  36. 36. Classic Lucene/Solr Relevancy Algorithm: *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost() Greenville Data Science & Analytics
  37. 37. Classic Lucene/Solr Relevancy Algorithm: *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost() Greenville Data Science & Analytics
  38. 38. • Term Frequency: “How well a term describes a document?” – Measure: how often a term occurs per document • Inverse Document Frequency: “How important is a term overall?” – Measure: how rare the term is across all documents TF * IDF *Source: Solr in Action, chapter 3 Greenville Data Science & Analytics
  39. 39. BM25 (Okapi “Best Match” 25th Iteration) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization. Greenville Data Science & Analytics
  40. 40. News Search : popularity and freshness drive relevance Restaurant Search: geographical proximity and price range are critical Ecommerce: likelihood of a purchase is key Movie search: More popular titles are generally more relevant Job search: category of job, salary range, and geographical proximity matter TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors! That’s great, but what about domain-specific knowledge? Greenville Data Science & Analytics
  41. 41. *Example from chapter 16 of Solr in Action Domain-specific relevancy calculation (News Website Example) News website: /select? fq=$myQuery& q=_query_:"{!func}scale(query($myQuery),0,100)" AND _query_:"{!func}div(100,map(geodist(),0,1,1))" AND _query_:"{!func}recip(rord(publicationDate),0,100,100)" AND _query_:"{!func}scale(popularity,0,100)"& myQuery="street festival"& sfield=location& pt=33.748,-84.391 25% 25% 25% 25% Greenville Data Science & Analytics
  42. 42. Fancy boosting functions (Restaurant Search Example) Distance (50%) + keywords (30%) + category (20%) q=_val_:"scale(mul(query($keywords),1),0,30)" AND _val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,50)” AND _val_:"scale(mul(query($category),1),0,20)" &keywords=filet mignon &radiusInKm=48.28 &distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)” &category=”fine dining" &fq={!cache=false v=$keywords} Greenville Data Science & Analytics
  43. 43. This is powerful, but feels like a lot of work to get right…
  44. 44. what is “reflected intelligence”?
  45. 45. The Three C’s Content: Keywords and other features in your documents Collaboration: How other’s have chosen to interact with your system Context: Available information about your users and their intent Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted” Greenville Data Science & Analytics
  46. 46. ● Recommendation Algorithms ● Building user profiles from past searches, clicks, and other actions ● Identifying correlations between keywords/phrases ● Building out automatically-generated ontologies from content and queries ● Determining relevancy judgements (precision, recall, nDCG, etc.) from click logs ● Learning to Rank - using relevancy judgements and machine learning to train a relevance model ● Discovering misspellings, synonyms, acronyms, and related keywords ● Disambiguation of keyword phrases with multiple meanings ● Learning what’s important in your content Examples of Reflected Intelligence Greenville Data Science & Analytics
  47. 47. John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development. Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry. Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job. Jane is a nurse educator in Boston seeking between $40K and $60K *Example from chapter 16 of Solr in Action Consider what you know about users Greenville Data Science & Analytics
  48. 48. http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action Query for Jane Jane is a nurse educator in Boston seeking between $40K and $60K Greenville Data Science & Analytics
  49. 49. { ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}} *Example documents available @ http://github.com/treygrainger/solr-in-action Search Results for Jane {"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183}, {"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359} Greenville Data Science & Analytics
  50. 50. You just built a recommendation engine!
  51. 51. Collaborative Filtering Term Documents user1 doc1, doc5 user2 doc2 user3 doc2 user4 doc1, doc3, doc4, doc5 user5 doc1, doc4 … … Document “Users who bought this product” field doc1 user1, user4, user5 doc2 user2, user3 doc3 user4 doc4 user4, user5 doc5 user4, user1 … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): Greenville Data Science & Analytics
  52. 52. Step 1: Find similar users who like the same documents Document “Users who bought this product” field doc1 user1, user4, user5 doc2 user2, user3 doc3 user4 doc4 user4, user5 doc5 user4, user1 … … Top-scoring results (most similar users): 1) user4 (2 shared likes) 2) user5 (2 shared likes) 3) user 1 (1 shared like) doc1 user1 user4 user5 user4 user5 doc4 q=documentid: ("doc1" OR "doc4") *Source: Solr in Action, chapter 16 Greenville Data Science & Analytics
  53. 53. /solr/select/?q=userlikes:("user4"^2 OR "user5"^2 OR "user1"^1) Step 2: Search for docs “liked” by those similar users Term Documents user1 doc1, doc5 user2 doc2 user3 doc2 user4 doc1, doc3, doc4, doc5 user5 doc1, doc4 … … Top recommended documents: 1) doc1 (matches user4, user5, user1) 2) doc4 (matches user4, user5) 3) doc5 (matches user4, user1) 4) doc3 (matches user4) // doc2 does not match Most similar users: 1) user4 (2 shared likes) 2) user5 (2 shared likes) 3) user 1 (1 shared like) *Source: Solr in Action, chapter 16 Greenville Data Science & Analytics
  54. 54. Using matrix factorization is typically more efficient (Ships with Fusion 3.1): Greenville Data Science & Analytics
  55. 55. Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements Greenville Data Science & Analytics
  56. 56. Demo: Signals & Recommendations
  57. 57. • 200%+ increase in click-through rates • 91% lower TCO • 50,000 fewer support tickets • Increased customer satisfaction
  58. 58. Learning to Rank
  59. 59. Learning to Rank (LTR) ● It applies machine learning techniques to discover the best combination of features that provide best ranking. ● It requires labeled set of documents with relevancy scores for given set of queries ● Features used for ranking are usually more computationally expensive than the ones used for matching ● It typically re-ranks a subset of the matched documents (e.g. top 1000) Greenville Data Science & Analytics
  60. 60. Greenville Data Science & Analytics
  61. 61. Common LTR Algorithms • RankNet* (neural networks, boosted trees) • LambdaMart* (regression trees) • SVM Rank** (SVM classifier) ** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf * http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf Greenville Data Science & Analytics
  62. 62. LambdaMart Example Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 Greenville Data Science & Analytics
  63. 63. Demo: Learning to Rank
  64. 64. #1: Pull, Build, Start Solr git clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solr ant server bin/solr -e techproducts -Dsolr.ltr.enabled=true #2: Run Searches http://localhost:8983/solr/techproducts/browse?q=ipod #3: Supply User Relevancy Judgements cd contrib/ltr/example/ nano user_queries.txt #4: Install Training Library curl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gz tar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinear cd liblinear && make && cd ../ #5: Train and Upload Model ./train_and_upload_demo_model.py -c config.json #6: Re-run Searches using Machine-learned Ranking Model http://localhost:8983/solr/techproducts/browse?q=ipod &rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}
  65. 65. # Run Searches http://localhost:8983/solr/techproducts/select?q=ipod
  66. 66. # Supply User Relevancy Judgements nano contrib/ltr/example/user_queries.txt #Format: query | doc id | relevancy judgement | source # Train and Upload Model ./train_and_upload_demo_model.py -c config.json
  67. 67. # Re-run Searches using Machine-learned Ranking Model http://localhost:8984/solr/techproducts/browse?q=ipod &rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}
  68. 68. Traditional Keyword Search Recommendations Semantic Search User Intent Personalized Search Augmented Search Domain-aware Matching The Relevancy Spectrum Greenville Data Science & Analytics
  69. 69. Streaming Expressions & Graph Traversals
  70. 70. • Perform relational operations on streams • Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic • Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update Streaming Expressions
  71. 71. Streaming Expressions - Examples Shortest-path Graph Traversal Parallel Batch Procesing Train a Logistic Regression Model Distributed Joins Rapid Export of all Search Results Pull Results from External Database Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Classifying Search Results
  72. 72. Graph Use Cases • Anomaly detection / fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Relationship discovery / scoring Examples o Find all draft blog posts about “Parallel SQL” written by a developer o Find all tweets mentioning “Solr” by me or people I follow o Find all draft blog posts about “Parallel SQL” written by a developer o Find 3-star hotels in NYC my friends stayed in last year Greenville Data Science & Analytics
  73. 73. Solr Graph Timeline • Some data is much more naturally represented as a graph structure • Solr 6.0: Introduced the Graph Query Parser • Solr 6.1: Introduced Graph Streaming expressions … • Solr 6.6: Current Version • TBD: Semantic Knowledge Graph (patch available) Greenville Data Science & Analytics
  74. 74. Graph Query Parser • Query-time, cyclic aware graph traversal is able to rank documents based on relationships • Provides controls for depth, filtering of results and inclusion of root and/or leaves • Limitations: single node/shard only Examples: • http://localhost:8983/solr/graph/query?fl=id,score& q={!graph from=in_edge to=out_edge}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10] Greenville Data Science & Analytics
  75. 75. Graph Streaming Expressions • Part of Solr’s broader Streaming Expressions capability • Implements a powerful, breadth-first traversal • Works across shards AND collections • Supports aggregations • Cycle aware curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream" Greenville Data Science & Analytics
  76. 76. All movies that user 389 watched expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i") Greenville Data Science & Analytics
  77. 77. All movies that viewers of a specific movie watched expr:gatherNodes(movielens, gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true" ) Movie 161: “The Air Up There” Greenville Data Science & Analytics
  78. 78. Collaborative Filtering expr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ) ) Greenville Data Science & Analytics
  79. 79. Comparing Graph Choices Solr Elastic Graph Neo4J Spark GraphX Best Use Case QParser: predef. relationships as filters Expressions: fast, query-based, dist. graph ops Limited to sequential, term relatedness exploration only Graph ops and querying that fit on a single node Large-scale, iterative graph ops Common Graph Algorithms (e.g. Pregel, Traversal) Partial No Yes Yes Scaling QParser: Co-located Shards only Expressions: Yes Yes Master/Replica Yes Commercial License Required No Yes GPLv3 No Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party Greenville Data Science & Analytics
  80. 80. Basic Keyword Search (inverted index, tf-idf, bm25, query formulation, etc.) Taxonomies / Entity Extraction (entity recognition, ontologies, synonyms, etc.) Query Intent (query classification, semantic query parsing, concept expansion, rules, clustering, classification) Relevancy Tuning (signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks) Self-learning Data-driven App Sophistication Greenville Data Science & Analytics
  81. 81. Additional References: Greenville Data Science & Analytics
  82. 82. Additional References: Greenville Data Science & Analytics
  83. 83. Contact Info Trey Grainger trey.grainger@lucidworks.com @treygrainger http://solrinaction.com Meetup discount (39% off): 39grainger Other presentations: http://www.treygrainger.com Greenville Data Science & Analytics
  84. 84. Greenville Data Science & Analytics Audience Questions #1: How can you figure out the meaning or intent of keywords, particularly when there are multiple ways to represent them or multiple meanings?
  85. 85. How do we handle phrases with ambiguous meanings? Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  86. 86. A few methodologies: 1) Query Log Mining 2) Semantic Knowledge Graph Knowledge Graph Greenville Data Science & Analytics
  87. 87. Query Log Mining: Discovering ambiguous phrases 1) Classify users who ran each search in the search logs (i.e. by the job title classifications of the jobs to which they applied) 3) Segment the search term => related search terms list by classification, to return a separate related terms list per classification 2) Create a probabilistic graphical model of those classifications mapped to each keyword phrase. Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  88. 88. Semantic Knowledge Graph: Discovering ambiguous phrases 1) Exact same concept, but use a document classification field (i.e. category) as the first level of your graph, and the related terms as the second level to which you traverse. 2) Has the benefit that you don’t need query logs to mine, but it will be representative of your data, as opposed to your user’s intent, so the quality depends on how clean and representative your documents are. Greenville Data Science & Analytics
  89. 89. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  90. 90. Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  91. 91. Greenville Data Science & Analytics Audience Questions #2: Can you tell me more about the semantic knowledge graph? See: http://www.treygrainger.com/posts/presentations/the- semantic-knowledge-graph/

×