SlideShare a Scribd company logo
1 of 96
Building Search & Recommendation Engines
Trey Grainger
SVP of Engineering, Lucidworks
Greenville Data Science
2017.06.29
Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
• Startup Investor / Advisor
About Me
what do you do?
Search-Driven
Everything
Customer
Service
Customer
Insights
Fraud Surveillance
Research
Portal
Online Retail
Digital
Content
Apache Solr
“Solr is the popular, blazing-fast,
open source enterprise search
platform built on Apache Lucene™.”
Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics (nested / relational)
● Highlighting
● Spelling Correction
● Autocomplete/Type-ahead Prediction
● Sorting, Grouping, Deduplication
● Distributed, Fault-tolerant, Scalable
● Geospatial search
● Complex Function queries
● Recommendations (More Like This)
● Graph Queries and Traversals
● SQL Query Support
● Streaming Aggregations
● Batch and Streaming processing
● Highly Configurable / Plugins
● Learning to Rank
● Building machine-learning models
● … many more
*source: Solr in Action, chapter 2
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
Lucidworks Fusion
All Your Data
• Over 50 connectors to
integrate all your data
• Robust parsing framework
to seamlessly ingest all your
document types
• Point and click Indexing
configuration and iterative
simulation of results for full
control over your ETL
process
• Your security model
enforced end-to-end from
ingest to search across your
different datasources
Experience
Management
• Relevancy tuning: Point-and-click
query pipeline configuration allow
fine-grained control of results.
• Machine-driven relevancy:
Signals aggregation learn and
automatically tune relevancy and
drive recommendations out of the
box .
• Powerful pipeline stages:
Customize fields, stages,
synonyms, boosts, facets,
machine learning models, your
own scripted behavior, and
dozens of other powerful search
stages.
• Turnkey search UI
(Lucidworks View): Build a
sophisticated end-to-end search
application in just hours.
• Seamless integration of your
entire search & analytics
platform
• All capabilities exposed
through secured API's, so
you can use our UI or build
your own.
• End-to-end security policies
can be applied out of the
box to every aspect of your
search ecosystem.
• Distributed, fault-tolerant
scaling and supervision of
your entire search
application
• Modular library of UI components to create
prototypes in hours, not weeks.
• Fine-grained security for government and
Fortune 500, military, law enforcement,
enforcing permissions by item, role,
geography or other parameters.
• Stateless architecture so apps are robust,
easy to deploy, and highly scalable.
• Supports over 25 data platforms including
Solr, SharePoint, Elasticsearch, Cloudera,
Attivio, FAST, MongoDB, and many more -
and of course Lucidworks Fusion.
• Full library of visualization components for
charts, pivots, graphs and more.
• Pre-tested re-usable modules include
pagination, faceting, geospatial mapping, rich
snippets, heatmaps, topic pages, and more.
Create custom search and
discovery applications in minutes.
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader
Election
Load
Balancing
Shared Config
Management
Worker Worker
Apache Spark
Cluster
Manager
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
RESTAPI
Admin UI
Twigkit
LOGS FILE WEB DATABASE CLOUD
HDFS(Optional)
Lucidworks Fusion Architecture
Fusion powers search for the brightest companies in the
world.
Lucidworks Fusion
search & relevancy
Basic Keyword Search
The beginning of a typical search journey
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
Greenville Data Science & Analytics
/solr/select/?q=apache solr
Field Documents
… …
apache doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
Greenville Data Science & Analytics
Text Analysis
Generating terms to index from raw text
Text Analysis in Solr
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in SolrText Analysis in Solr
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
Multi-lingual Text Analysis
Analyzing text across multiple languages
Example English Analysis Chains
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
words="lang/stopwords_en.txt”
ignoreCase="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="lang/en_protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="lang/en_synonyms.txt" I
ignoreCase="true"
expand="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Greenville Data Science & Analytics
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Greenville Data Science & Analytics
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Greenville Data Science & Analytics
Which Stemmer do I choose?
*From Solr in Action, Chapter 14
Greenville Data Science & Analytics
Common English Stemmers
Greenville Data Science & Analytics
Greenville Data Science & Analytics
Relevancy Ranking
Scoring the results, returning the best matches
Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() ¡ lengthNorm(f) ¡ f.getBoost()
Greenville Data Science & Analytics
Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() ¡ lengthNorm(f) ¡ f.getBoost()
Greenville Data Science & Analytics
• Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
Greenville Data Science & Analytics
BM25 (Okapi “Best Match” 25th Iteration)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document normalization.
Greenville Data Science & Analytics
News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords can’t hold it’s own against good
domain-specific relevance factors!
That’s great, but what about domain-specific knowledge?
Greenville Data Science & Analytics
*Example from chapter 16 of Solr in Action
Domain-specific relevancy calculation (News Website Example)
News website:
/select?
fq=$myQuery&
q=_query_:"{!func}scale(query($myQuery),0,100)"
AND _query_:"{!func}div(100,map(geodist(),0,1,1))"
AND _query_:"{!func}recip(rord(publicationDate),0,100,100)"
AND _query_:"{!func}scale(popularity,0,100)"&
myQuery="street festival"&
sfield=location&
pt=33.748,-84.391
25%
25%
25%
25%
Greenville Data Science & Analytics
Fancy boosting functions (Restaurant Search Example)
Distance (50%) + keywords (30%) + category (20%)
q=_val_:"scale(mul(query($keywords),1),0,30)" AND
_val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,50)” AND
_val_:"scale(mul(query($category),1),0,20)"
&keywords=filet mignon
&radiusInKm=48.28
&distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)”
&category=”fine dining"
&fq={!cache=false v=$keywords}
Greenville Data Science & Analytics
This is powerful, but feels like
a lot of work to get right…
what is “reflected intelligence”?
The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”
Greenville Data Science & Analytics
● Recommendation Algorithms
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out automatically-generated ontologies from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from click
logs
● Learning to Rank - using relevancy judgements and machine learning to train
a relevance model
● Discovering misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence
Greenville Data Science & Analytics
John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location
in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a
Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
Greenville Data Science & Analytics
http://localhost:8983/solr/jobs/select/?
fl=jobtitle,city,state,salary&
q=(
jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10
)
AND (
(city:"Boston" AND state:"MA")^15
OR state:"MA")
AND _val_:"map(salary, 40000, 60000,10, 0)”
*Example from chapter 16 of Solr in Action
Query for Jane
Jane is a nurse educator in Boston seeking between $40K and $60K
Greenville Data Science & Analytics
{ ...
"response":{"numFound":22,"start":0,"docs":[
{"jobtitle":" Clinical Educator
(New England/ Boston)",
"city":"Boston",
"state":"MA",
"salary":41503},
…]}}
*Example documents available @ http://github.com/treygrainger/solr-in-action
Search Results for Jane
{"jobtitle":"Nurse Educator",
"city":"Braintree",
"state":"MA",
"salary":56183},
{"jobtitle":"Nurse Educator",
"city":"Brighton",
"state":"MA",
"salary":71359}
Greenville Data Science & Analytics
You just built a
recommendation engine!
Collaborative Filtering
Term Documents
user1 doc1, doc5
user2 doc2
user3 doc2
user4 doc1, doc3,
doc4, doc5
user5 doc1, doc4
… …
Document “Users who bought this product” field
doc1 user1, user4, user5
doc2 user2, user3
doc3 user4
doc4 user4, user5
doc5 user4, user1
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
Greenville Data Science & Analytics
Step 1: Find similar users who like the same documents
Document “Users who bought this product” field
doc1 user1, user4, user5
doc2 user2, user3
doc3 user4
doc4 user4, user5
doc5 user4, user1
… …
Top-scoring results (most similar users):
1) user4 (2 shared likes)
2) user5 (2 shared likes)
3) user 1 (1 shared like)
doc1
user1 user4
user5
user4 user5
doc4
q=documentid: ("doc1" OR "doc4")
*Source: Solr in Action, chapter 16
Greenville Data Science & Analytics
/solr/select/?q=userlikes:("user4"^2
OR "user5"^2 OR "user1"^1)
Step 2: Search for docs “liked” by those similar users
Term Documents
user1 doc1, doc5
user2 doc2
user3 doc2
user4 doc1, doc3,
doc4, doc5
user5 doc1, doc4
… …
Top recommended documents:
1) doc1 (matches user4, user5, user1)
2) doc4 (matches user4, user5)
3) doc5 (matches user4, user1)
4) doc3 (matches user4)
// doc2 does not match
Most similar users:
1) user4 (2 shared likes)
2) user5 (2 shared likes)
3) user 1 (1 shared like)
*Source: Solr in Action, chapter 16
Greenville Data Science & Analytics
Using matrix factorization is typically more efficient
(Ships with Fusion 3.1):
Greenville Data Science & Analytics
Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
Greenville Data Science & Analytics
Demo:
Signals & Recommendations
• 200%+ increase in
click-through rates
• 91% lower TCO
• 50,000 fewer support
tickets
• Increased customer
satisfaction
Learning to Rank
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination
of features that provide best ranking.
● It requires labeled set of documents with relevancy scores for given set
of queries
● Features used for ranking are usually more computationally expensive
than the ones used for matching
● It typically re-ranks a subset of the matched documents (e.g. top 1000)
Greenville Data Science & Analytics
Greenville Data Science & Analytics
Common LTR Algorithms
• RankNet* (neural networks, boosted trees)
• LambdaMart* (regression trees)
• SVM Rank** (SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
Greenville Data Science & Analytics
LambdaMart Example
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
Greenville Data Science & Analytics
Demo: Learning to Rank
#1: Pull, Build, Start Solr
git clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solr
ant server
bin/solr -e techproducts -Dsolr.ltr.enabled=true
#2: Run Searches
http://localhost:8983/solr/techproducts/browse?q=ipod
#3: Supply User Relevancy Judgements
cd contrib/ltr/example/
nano user_queries.txt
#4: Install Training Library
curl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gz
tar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinear
cd liblinear && make && cd ../
#5: Train and Upload Model
./train_and_upload_demo_model.py -c config.json
#6: Re-run Searches using Machine-learned Ranking Model
http://localhost:8983/solr/techproducts/browse?q=ipod
&rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}
# Run Searches
http://localhost:8983/solr/techproducts/select?q=ipod
# Supply User Relevancy Judgements
nano contrib/ltr/example/user_queries.txt
#Format: query | doc id | relevancy judgement | source
# Train and Upload Model
./train_and_upload_demo_model.py -c config.json
# Re-run Searches using Machine-learned Ranking Model
http://localhost:8984/solr/techproducts/browse?q=ipod
&rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}
Traditional
Keyword
Search
Recommendations
Semantic
Search
User Intent
Personalized
Search
Augmented
Search
Domain-aware
Matching
The Relevancy
Spectrum
Greenville Data Science & Analytics
Streaming Expressions & Graph Traversals
• Perform relational operations on
streams
• Stream sources: search, jdbc, facets,
features, gatherNodes, shortestPath,
train, features, model, random, stats,
topic
• Stream decorators: classify, commit,
complement, daemon, executor, fetch,
having, leftOuterJoin, hashJoin,
innerJoin, intersect, merge, null,
outerHashJoin, parallel, priority,
reduce, rollup, scoreNodes, select,
sort, top, unique, update
Streaming Expressions
Streaming Expressions - Examples
Shortest-path Graph
Traversal
Parallel Batch
Procesing
Train a Logistic Regression
Model
Distributed Joins
Rapid Export of all
Search Results
Pull Results from External Database
Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classifying
Search Results
Graph Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
Greenville Data Science & Analytics
Solr Graph Timeline
• Some data is much more naturally represented as a graph
structure
• Solr 6.0: Introduced the Graph Query Parser
• Solr 6.1: Introduced Graph Streaming expressions
…
• Solr 6.6: Current Version
• TBD: Semantic Knowledge Graph (patch available)
Greenville Data Science & Analytics
Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: single node/shard only
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
Greenville Data Science & Analytics
Graph Streaming Expressions
• Part of Solr’s broader Streaming Expressions capability
• Implements a powerful, breadth-first traversal
• Works across shards AND collections
• Supports aggregations
• Cycle aware
curl -X POST -H "Content-Type: application/x-www-form-urlencoded"
-d ‘expr=…’"http://localhost:18984/solr/movielens/stream"
Greenville Data Science & Analytics
All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
Greenville Data Science & Analytics
All movies that viewers of a specific movie watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
Movie 161: “The Air Up There”
Greenville Data Science & Analytics
Collaborative Filtering
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens,
top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i",
sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i",
maxDocFreq="10000", count(*)
)
),
walk="node->user_id_i", gather="movie_id_i", count(*)
)
)
Greenville Data Science & Analytics
Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expressions: fast,
query-based, dist.
graph ops
Limited to sequential,
term relatedness
exploration only
Graph ops and
querying that fit on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: Co-located
Shards only
Expressions: Yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(e.g. Gephi)
Kibana Neo4j browser 3rd party
Greenville Data Science & Analytics
Basic Keyword Search
(inverted index, tf-idf, bm25,
query formulation, etc.)
Taxonomies / Entity
Extraction
(entity recognition,
ontologies, synonyms, etc.)
Query Intent
(query classification, semantic
query parsing, concept
expansion, rules, clustering,
classification)
Relevancy Tuning
(signals, AB testing/genetic
algorithms, Learning to Rank,
Neural Networks)
Self-learning
Data-driven App Sophistication
Greenville Data Science & Analytics
Additional References:
Greenville Data Science & Analytics
Additional References:
Greenville Data Science & Analytics
Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Meetup discount (39% off): 39grainger
Other presentations:
http://www.treygrainger.com
Greenville Data Science & Analytics
Greenville Data Science & Analytics
Audience Questions
#1: How can you figure out the meaning or intent of
keywords, particularly when there are multiple ways to
represent them or multiple meanings?
How do we handle phrases with ambiguous meanings?
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph
Greenville Data Science & Analytics
Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job title
classifications of the jobs to
which they applied)
3) Segment the search term => related search terms list by classification,
to return a separate related terms list per classification
2) Create a probabilistic graphical model of those classifications mapped
to each keyword phrase.
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean and
representative your documents are.
Greenville Data Science & Analytics
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
Greenville Data Science & Analytics
Audience Questions
#2: Can you tell me more about the semantic knowledge
graph?
See:
http://www.treygrainger.com/posts/presentations/the-
semantic-knowledge-graph/

More Related Content

What's hot

Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Enginelucenerevolution
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Lucidworks
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...lucenerevolution
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solrTrey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewKevin Watters
 

What's hot (20)

Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solr
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 

Similar to Building Search & Recommendation Engines

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceWebinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceLucidworks
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStartSumo Logic
 
Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Getting started with Splunk - Break out Session
Getting started with Splunk - Break out SessionGetting started with Splunk - Break out Session
Getting started with Splunk - Break out SessionGeorg Knon
 
Getting started with Splunk
Getting started with SplunkGetting started with Splunk
Getting started with SplunkSplunk
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 

Similar to Building Search & Recommendation Engines (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceWebinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStart
 
Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Getting started with Splunk - Break out Session
Getting started with Splunk - Break out SessionGetting started with Splunk - Break out Session
Getting started with Splunk - Break out Session
 
Getting started with Splunk
Getting started with SplunkGetting started with Splunk
Getting started with Splunk
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 

More from Trey Grainger

Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User IntentTrey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementTrey Grainger
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 

More from Trey Grainger (9)

Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 

Recently uploaded

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

Building Search & Recommendation Engines

  • 1. Building Search & Recommendation Engines Trey Grainger SVP of Engineering, Lucidworks Greenville Data Science 2017.06.29
  • 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Frequent conference speaker • Founder of Celiaccess.com, the gluten-free search engine • Lucene/Solr contributor • Startup Investor / Advisor About Me
  • 4.
  • 7. “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.”
  • 8. Key Solr Features: ● Multilingual Keyword search ● Relevancy Ranking of results ● Faceting & Analytics (nested / relational) ● Highlighting ● Spelling Correction ● Autocomplete/Type-ahead Prediction ● Sorting, Grouping, Deduplication ● Distributed, Fault-tolerant, Scalable ● Geospatial search ● Complex Function queries ● Recommendations (More Like This) ● Graph Queries and Traversals ● SQL Query Support ● Streaming Aggregations ● Batch and Streaming processing ● Highly Configurable / Plugins ● Learning to Rank ● Building machine-learning models ● … many more *source: Solr in Action, chapter 2
  • 9. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 11.
  • 13. • Over 50 connectors to integrate all your data • Robust parsing framework to seamlessly ingest all your document types • Point and click Indexing configuration and iterative simulation of results for full control over your ETL process • Your security model enforced end-to-end from ingest to search across your different datasources
  • 15. • Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results. • Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box . • Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages. • Turnkey search UI (Lucidworks View): Build a sophisticated end-to-end search application in just hours.
  • 16. • Seamless integration of your entire search & analytics platform • All capabilities exposed through secured API's, so you can use our UI or build your own. • End-to-end security policies can be applied out of the box to every aspect of your search ecosystem. • Distributed, fault-tolerant scaling and supervision of your entire search application
  • 17.
  • 18. • Modular library of UI components to create prototypes in hours, not weeks. • Fine-grained security for government and Fortune 500, military, law enforcement, enforcing permissions by item, role, geography or other parameters. • Stateless architecture so apps are robust, easy to deploy, and highly scalable. • Supports over 25 data platforms including Solr, SharePoint, Elasticsearch, Cloudera, Attivio, FAST, MongoDB, and many more - and of course Lucidworks Fusion. • Full library of visualization components for charts, pivots, graphs and more. • Pre-tested re-usable modules include pagination, faceting, geospatial mapping, rich snippets, heatmaps, topic pages, and more. Create custom search and discovery applications in minutes.
  • 19. SECURITY BUILT-IN Shards Shards Apache Solr Apache Zookeeper ZK 1 Leader Election Load Balancing Shared Config Management Worker Worker Apache Spark Cluster Manager Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Twigkit LOGS FILE WEB DATABASE CLOUD HDFS(Optional) Lucidworks Fusion Architecture
  • 20. Fusion powers search for the brightest companies in the world.
  • 23. Basic Keyword Search The beginning of a typical search journey
  • 24. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): The inverted index Greenville Data Science & Analytics
  • 25. /solr/select/?q=apache solr Field Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents Greenville Data Science & Analytics
  • 26. Text Analysis Generating terms to index from raw text
  • 27. Text Analysis in Solr A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  • 28. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  • 29. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  • 30. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in SolrText Analysis in Solr *From Solr in Action, Chapter 6 Greenville Data Science & Analytics
  • 31. Multi-lingual Text Analysis Analyzing text across multiple languages
  • 32. Example English Analysis Chains <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Greenville Data Science & Analytics
  • 33. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solr in Action Greenville Data Science & Analytics
  • 34. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solr in Action Greenville Data Science & Analytics
  • 35. Which Stemmer do I choose? *From Solr in Action, Chapter 14 Greenville Data Science & Analytics
  • 36. Common English Stemmers Greenville Data Science & Analytics
  • 37. Greenville Data Science & Analytics
  • 38. Relevancy Ranking Scoring the results, returning the best matches
  • 39. Classic Lucene/Solr Relevancy Algorithm: *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) ¡ idf(t)2 ¡ t.getBoost() ¡ norm(t, d) ) ¡ coord(q, d) ¡ queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 ¡ ∑ (idf(t) ¡ t.getBoost() )2 t in q norm(t, d) = d.getBoost() ¡ lengthNorm(f) ¡ f.getBoost() Greenville Data Science & Analytics
  • 40. Classic Lucene/Solr Relevancy Algorithm: *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) ¡ idf(t)2 ¡ t.getBoost() ¡ norm(t, d) ) ¡ coord(q, d) ¡ queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 ¡ ∑ (idf(t) ¡ t.getBoost() )2 t in q norm(t, d) = d.getBoost() ¡ lengthNorm(f) ¡ f.getBoost() Greenville Data Science & Analytics
  • 41. • Term Frequency: “How well a term describes a document?” – Measure: how often a term occurs per document • Inverse Document Frequency: “How important is a term overall?” – Measure: how rare the term is across all documents TF * IDF *Source: Solr in Action, chapter 3 Greenville Data Science & Analytics
  • 42. BM25 (Okapi “Best Match” 25th Iteration) Score(q, d) = ∑ idf(t) ¡ ( tf(t in d) ¡ (k + 1) ) / ( tf(t in d) + k ¡ (1 – b + b ¡ |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization. Greenville Data Science & Analytics
  • 43. News Search : popularity and freshness drive relevance Restaurant Search: geographical proximity and price range are critical Ecommerce: likelihood of a purchase is key Movie search: More popular titles are generally more relevant Job search: category of job, salary range, and geographical proximity matter TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors! That’s great, but what about domain-specific knowledge? Greenville Data Science & Analytics
  • 44. *Example from chapter 16 of Solr in Action Domain-specific relevancy calculation (News Website Example) News website: /select? fq=$myQuery& q=_query_:"{!func}scale(query($myQuery),0,100)" AND _query_:"{!func}div(100,map(geodist(),0,1,1))" AND _query_:"{!func}recip(rord(publicationDate),0,100,100)" AND _query_:"{!func}scale(popularity,0,100)"& myQuery="street festival"& sfield=location& pt=33.748,-84.391 25% 25% 25% 25% Greenville Data Science & Analytics
  • 45. Fancy boosting functions (Restaurant Search Example) Distance (50%) + keywords (30%) + category (20%) q=_val_:"scale(mul(query($keywords),1),0,30)" AND _val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,50)” AND _val_:"scale(mul(query($category),1),0,20)" &keywords=filet mignon &radiusInKm=48.28 &distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)” &category=”fine dining" &fq={!cache=false v=$keywords} Greenville Data Science & Analytics
  • 46. This is powerful, but feels like a lot of work to get right…
  • 47. what is “reflected intelligence”?
  • 48. The Three C’s Content: Keywords and other features in your documents Collaboration: How other’s have chosen to interact with your system Context: Available information about your users and their intent Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted” Greenville Data Science & Analytics
  • 49. ● Recommendation Algorithms ● Building user profiles from past searches, clicks, and other actions ● Identifying correlations between keywords/phrases ● Building out automatically-generated ontologies from content and queries ● Determining relevancy judgements (precision, recall, nDCG, etc.) from click logs ● Learning to Rank - using relevancy judgements and machine learning to train a relevance model ● Discovering misspellings, synonyms, acronyms, and related keywords ● Disambiguation of keyword phrases with multiple meanings ● Learning what’s important in your content Examples of Reflected Intelligence Greenville Data Science & Analytics
  • 50. John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development. Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry. Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job. Jane is a nurse educator in Boston seeking between $40K and $60K *Example from chapter 16 of Solr in Action Consider what you know about users Greenville Data Science & Analytics
  • 51. http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action Query for Jane Jane is a nurse educator in Boston seeking between $40K and $60K Greenville Data Science & Analytics
  • 52. { ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}} *Example documents available @ http://github.com/treygrainger/solr-in-action Search Results for Jane {"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183}, {"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359} Greenville Data Science & Analytics
  • 53. You just built a recommendation engine!
  • 54. Collaborative Filtering Term Documents user1 doc1, doc5 user2 doc2 user3 doc2 user4 doc1, doc3, doc4, doc5 user5 doc1, doc4 … … Document “Users who bought this product” field doc1 user1, user4, user5 doc2 user2, user3 doc3 user4 doc4 user4, user5 doc5 user4, user1 … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): Greenville Data Science & Analytics
  • 55. Step 1: Find similar users who like the same documents Document “Users who bought this product” field doc1 user1, user4, user5 doc2 user2, user3 doc3 user4 doc4 user4, user5 doc5 user4, user1 … … Top-scoring results (most similar users): 1) user4 (2 shared likes) 2) user5 (2 shared likes) 3) user 1 (1 shared like) doc1 user1 user4 user5 user4 user5 doc4 q=documentid: ("doc1" OR "doc4") *Source: Solr in Action, chapter 16 Greenville Data Science & Analytics
  • 56. /solr/select/?q=userlikes:("user4"^2 OR "user5"^2 OR "user1"^1) Step 2: Search for docs “liked” by those similar users Term Documents user1 doc1, doc5 user2 doc2 user3 doc2 user4 doc1, doc3, doc4, doc5 user5 doc1, doc4 … … Top recommended documents: 1) doc1 (matches user4, user5, user1) 2) doc4 (matches user4, user5) 3) doc5 (matches user4, user1) 4) doc3 (matches user4) // doc2 does not match Most similar users: 1) user4 (2 shared likes) 2) user5 (2 shared likes) 3) user 1 (1 shared like) *Source: Solr in Action, chapter 16 Greenville Data Science & Analytics
  • 57. Using matrix factorization is typically more efficient (Ships with Fusion 3.1): Greenville Data Science & Analytics
  • 58. Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements Greenville Data Science & Analytics
  • 60.
  • 61.
  • 62. • 200%+ increase in click-through rates • 91% lower TCO • 50,000 fewer support tickets • Increased customer satisfaction
  • 64. Learning to Rank (LTR) ● It applies machine learning techniques to discover the best combination of features that provide best ranking. ● It requires labeled set of documents with relevancy scores for given set of queries ● Features used for ranking are usually more computationally expensive than the ones used for matching ● It typically re-ranks a subset of the matched documents (e.g. top 1000) Greenville Data Science & Analytics
  • 65. Greenville Data Science & Analytics
  • 66. Common LTR Algorithms • RankNet* (neural networks, boosted trees) • LambdaMart* (regression trees) • SVM Rank** (SVM classifier) ** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf * http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf Greenville Data Science & Analytics
  • 67. LambdaMart Example Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 Greenville Data Science & Analytics
  • 69. #1: Pull, Build, Start Solr git clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solr ant server bin/solr -e techproducts -Dsolr.ltr.enabled=true #2: Run Searches http://localhost:8983/solr/techproducts/browse?q=ipod #3: Supply User Relevancy Judgements cd contrib/ltr/example/ nano user_queries.txt #4: Install Training Library curl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gz tar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinear cd liblinear && make && cd ../ #5: Train and Upload Model ./train_and_upload_demo_model.py -c config.json #6: Re-run Searches using Machine-learned Ranking Model http://localhost:8983/solr/techproducts/browse?q=ipod &rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}
  • 71. # Supply User Relevancy Judgements nano contrib/ltr/example/user_queries.txt #Format: query | doc id | relevancy judgement | source # Train and Upload Model ./train_and_upload_demo_model.py -c config.json
  • 72. # Re-run Searches using Machine-learned Ranking Model http://localhost:8984/solr/techproducts/browse?q=ipod &rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}
  • 74. Streaming Expressions & Graph Traversals
  • 75. • Perform relational operations on streams • Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic • Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update Streaming Expressions
  • 76. Streaming Expressions - Examples Shortest-path Graph Traversal Parallel Batch Procesing Train a Logistic Regression Model Distributed Joins Rapid Export of all Search Results Pull Results from External Database Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Classifying Search Results
  • 77. Graph Use Cases • Anomaly detection / fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Relationship discovery / scoring Examples o Find all draft blog posts about “Parallel SQL” written by a developer o Find all tweets mentioning “Solr” by me or people I follow o Find all draft blog posts about “Parallel SQL” written by a developer o Find 3-star hotels in NYC my friends stayed in last year Greenville Data Science & Analytics
  • 78. Solr Graph Timeline • Some data is much more naturally represented as a graph structure • Solr 6.0: Introduced the Graph Query Parser • Solr 6.1: Introduced Graph Streaming expressions … • Solr 6.6: Current Version • TBD: Semantic Knowledge Graph (patch available) Greenville Data Science & Analytics
  • 79. Graph Query Parser • Query-time, cyclic aware graph traversal is able to rank documents based on relationships • Provides controls for depth, filtering of results and inclusion of root and/or leaves • Limitations: single node/shard only Examples: • http://localhost:8983/solr/graph/query?fl=id,score& q={!graph from=in_edge to=out_edge}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10] Greenville Data Science & Analytics
  • 80. Graph Streaming Expressions • Part of Solr’s broader Streaming Expressions capability • Implements a powerful, breadth-first traversal • Works across shards AND collections • Supports aggregations • Cycle aware curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream" Greenville Data Science & Analytics
  • 81. All movies that user 389 watched expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i") Greenville Data Science & Analytics
  • 82. All movies that viewers of a specific movie watched expr:gatherNodes(movielens, gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true" ) Movie 161: “The Air Up There” Greenville Data Science & Analytics
  • 83. Collaborative Filtering expr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ) ) Greenville Data Science & Analytics
  • 84. Comparing Graph Choices Solr Elastic Graph Neo4J Spark GraphX Best Use Case QParser: predef. relationships as filters Expressions: fast, query-based, dist. graph ops Limited to sequential, term relatedness exploration only Graph ops and querying that fit on a single node Large-scale, iterative graph ops Common Graph Algorithms (e.g. Pregel, Traversal) Partial No Yes Yes Scaling QParser: Co-located Shards only Expressions: Yes Yes Master/Replica Yes Commercial License Required No Yes GPLv3 No Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party Greenville Data Science & Analytics
  • 85. Basic Keyword Search (inverted index, tf-idf, bm25, query formulation, etc.) Taxonomies / Entity Extraction (entity recognition, ontologies, synonyms, etc.) Query Intent (query classification, semantic query parsing, concept expansion, rules, clustering, classification) Relevancy Tuning (signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks) Self-learning Data-driven App Sophistication Greenville Data Science & Analytics
  • 88. Contact Info Trey Grainger trey.grainger@lucidworks.com @treygrainger http://solrinaction.com Meetup discount (39% off): 39grainger Other presentations: http://www.treygrainger.com Greenville Data Science & Analytics
  • 89. Greenville Data Science & Analytics Audience Questions #1: How can you figure out the meaning or intent of keywords, particularly when there are multiple ways to represent them or multiple meanings?
  • 90. How do we handle phrases with ambiguous meanings? Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  • 91. A few methodologies: 1) Query Log Mining 2) Semantic Knowledge Graph Knowledge Graph Greenville Data Science & Analytics
  • 92. Query Log Mining: Discovering ambiguous phrases 1) Classify users who ran each search in the search logs (i.e. by the job title classifications of the jobs to which they applied) 3) Segment the search term => related search terms list by classification, to return a separate related terms list per classification 2) Create a probabilistic graphical model of those classifications mapped to each keyword phrase. Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  • 93. Semantic Knowledge Graph: Discovering ambiguous phrases 1) Exact same concept, but use a document classification field (i.e. category) as the first level of your graph, and the related terms as the second level to which you traverse. 2) Has the benefit that you don’t need query logs to mine, but it will be representative of your data, as opposed to your user’s intent, so the quality depends on how clean and representative your documents are. Greenville Data Science & Analytics
  • 94. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  • 95. Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Greenville Data Science & Analytics
  • 96. Greenville Data Science & Analytics Audience Questions #2: Can you tell me more about the semantic knowledge graph? See: http://www.treygrainger.com/posts/presentations/the- semantic-knowledge-graph/