What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Unleash Your Potential - Namagunga Girls Coding Club
Reflected Intelligence in Lucene/Solr
1. Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
SVP of Engineering, Lucidworks
O C T O B E R 1 1 - 1 4 , 2 0 1 6 B O S T O N , M A
2. Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus a handful of research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
About Me
6. The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”
8. ● Recommendation Engines
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out automatically-generated ontologies from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from click
logs
● Learning to Rank - using relevancy judgements and machine learning to train
a relevance model
● Discovering misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence
11. Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
13. Classic Lucene Relevancy Algorithm (now switched to BM25):
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
14. • Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
15. News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords can’t hold it’s own against good
domain-specific relevance factors!
That’s great, but what about domain-specific knowledge?
16. John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location
in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a
Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
21. Taxonomies / Entity Extraction
Identifying the important entities within your domain
22.
23. Building a Taxonomy of Entities
Many ways to generate this:
• Topic Modelling
• Clustering of documents
• Statistical Analysis of interesting phrases
-Word2Vec / Dice Conceptual Search
• Buy a dictionary (often doesn’t work for
domain-specific search problems)
• Generate a model of domain-specific phrases by
mining query logs for commonly searched phrases within the domain*
* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
24. Differentiating related terms
Synonyms: cpa => certified public accountant
rn => registered nurse
r.n. => registered nurse
Ambiguous Terms: driver => driver (trucking) ~80% likelihood
driver => driver (software) ~20% likelihood
Related Terms: r.n. => nursing, bsn
hadoop => mapreduce, hive, pig
Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
31. Probabilistic Query Parser
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
32. Input: senior hadoop developer java ruby on rails perl
Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
33. Semantic Search Architecture – Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously
built, cleaned, and refined based upon common inputs from
interactions with real users of the system. The SolrTextTagger
works well for this.*
2) Also invoke a statistical phrase identifier to dynamically
identify unknown phrases using statistics from a corpus of data
(language model)
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
36. id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Uninverted IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
37. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
38. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Doc Values Index
Lookup
Doc Values Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title
has_related_job_title
39. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge
Graph
Scoring nodes in the Graph
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness": 0.9765, "popularity":369 },
{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },
{ "value":".net", "relatedness": 0.5417, "popularity":17683 },
{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },
{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },
{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }
+
-
Foreground Query:
"Hadoop"
40. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
has_related_job_title
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
45. How do we handle phrases with ambiguous meanings?
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
46. Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job title
classifications of the jobs to
which they applied)
3) Segment the search term => related search terms list by classification,
to return a separate related terms list per classification
2) Create a probabilistic graphical model of those classifications mapped
to each keyword phrase.
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
47. Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean and
representative your documents are.
48. Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
49. Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
52. How to Measure Relevancy?
A B C
Retrieved
Documents
Related
Documents
Precision = B/A
Recall = B/C
Problem:
Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at
the top of the retrieved documents, is that OK?
53. Normalized Discounted Cumulative Gain
Rank Relevancy
3 0.95
1 0.70
2 0.60
4 0.45
Rank Relevancy
1 0.95
2 0.85
3 0.80
4 0.65
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.
55. Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination of features that
provide best ranking.
● It requires labeled set of documents with relevancy scores for given set of queries
● Features used for ranking are usually more computationally expensive than the ones
used for matching
● It typically re-ranks a subset of the matched documents (e.g. top 1000)
58. LambdaMart Example
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
59. Obtaining Relevancy Judgements
• Typical Methodologies
1) Hire employees, contractors, or interns
-Pros:
Accuracy
-Cons:
Expensive
Not scalable (cost or man-power-wise)
Data Becomes Stale
• 2) Crowdsource
-Pros:
Less cost, more scalable
-Cons:
Less accurate
Data still becomes stale
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
60. Reflected Intelligence: Possible to infer relevancy judgements?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query
Query
Doc1 Doc2 Doc3
0
1 1
Query
Doc1 Doc2 Doc3
1
0 0
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016