In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Â
Building Search & Recommendation Engines
1. Building Search & Recommendation Engines
Trey Grainger
SVP of Engineering, Lucidworks
Greenville Data Science
2017.06.29
2. Trey Grainger
SVP of Engineering
⢠Previously Director of Engineering @ CareerBuilder
⢠MBA, Management of Technology â Georgia Tech
⢠BA, Computer Science, Business, & Philosophy â Furman University
⢠Information Retrieval & Web Search - Stanford University
Other fun projects:
⢠Co-author of Solr in Action, plus numerous research papers
⢠Frequent conference speaker
⢠Founder of Celiaccess.com, the gluten-free search engine
⢠Lucene/Solr contributor
⢠Startup Investor / Advisor
About Me
13. ⢠Over 50 connectors to
integrate all your data
⢠Robust parsing framework
to seamlessly ingest all your
document types
⢠Point and click Indexing
configuration and iterative
simulation of results for full
control over your ETL
process
⢠Your security model
enforced end-to-end from
ingest to search across your
different datasources
15. ⢠Relevancy tuning: Point-and-click
query pipeline configuration allow
fine-grained control of results.
⢠Machine-driven relevancy:
Signals aggregation learn and
automatically tune relevancy and
drive recommendations out of the
box .
⢠Powerful pipeline stages:
Customize fields, stages,
synonyms, boosts, facets,
machine learning models, your
own scripted behavior, and
dozens of other powerful search
stages.
⢠Turnkey search UI
(Lucidworks View): Build a
sophisticated end-to-end search
application in just hours.
16. ⢠Seamless integration of your
entire search & analytics
platform
⢠All capabilities exposed
through secured API's, so
you can use our UI or build
your own.
⢠End-to-end security policies
can be applied out of the
box to every aspect of your
search ecosystem.
⢠Distributed, fault-tolerant
scaling and supervision of
your entire search
application
17.
18. ⢠Modular library of UI components to create
prototypes in hours, not weeks.
⢠Fine-grained security for government and
Fortune 500, military, law enforcement,
enforcing permissions by item, role,
geography or other parameters.
⢠Stateless architecture so apps are robust,
easy to deploy, and highly scalable.
⢠Supports over 25 data platforms including
Solr, SharePoint, Elasticsearch, Cloudera,
Attivio, FAST, MongoDB, and many more -
and of course Lucidworks Fusion.
⢠Full library of visualization components for
charts, pivots, graphs and more.
⢠Pre-tested re-usable modules include
pagination, faceting, geospatial mapping, rich
snippets, heatmaps, topic pages, and more.
Create custom search and
discovery applications in minutes.
24. Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
⌠...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
⌠âŚ
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said âmooâ
once.
⌠âŚ
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
Greenville Data Science & Analytics
27. Text Analysis in Solr
A text field in Lucene/Solr has an Analyzer containing:
â Zero or more CharFilters
Takes incoming text and âcleans it upâ
before it is tokenized
⥠One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
⢠Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
28. A text field in Lucene/Solr has an Analyzer containing:
â Zero or more CharFilters
Takes incoming text and âcleans it upâ
before it is tokenized
⥠One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
⢠Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
29. A text field in Lucene/Solr has an Analyzer containing:
â Zero or more CharFilters
Takes incoming text and âcleans it upâ
before it is tokenized
⥠One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
⢠Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
30. A text field in Lucene/Solr has an Analyzer containing:
â Zero or more CharFilters
Takes incoming text and âcleans it upâ
before it is tokenized
⥠One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
⢠Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in SolrText Analysis in Solr
*From Solr in Action, Chapter 6
Greenville Data Science & Analytics
33. Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Greenville Data Science & Analytics
34. Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Greenville Data Science & Analytics
35. Which Stemmer do I choose?
*From Solr in Action, Chapter 14
Greenville Data Science & Analytics
39. Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
â ( tf(t in d) ¡ idf(t)2 ¡ t.getBoost() ¡ norm(t, d) ) ¡ coord(q, d) ¡ queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 ¡ â (idf(t) ¡ t.getBoost() )2
t in q
norm(t, d) = d.getBoost() ¡ lengthNorm(f) ¡ f.getBoost()
Greenville Data Science & Analytics
40. Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
â ( tf(t in d) ¡ idf(t)2 ¡ t.getBoost() ¡ norm(t, d) ) ¡ coord(q, d) ¡ queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 ¡ â (idf(t) ¡ t.getBoost() )2
t in q
norm(t, d) = d.getBoost() ¡ lengthNorm(f) ¡ f.getBoost()
Greenville Data Science & Analytics
41. ⢠Term Frequency: âHow well a term describes a document?â
â Measure: how often a term occurs per document
⢠Inverse Document Frequency: âHow important is a term overall?â
â Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
Greenville Data Science & Analytics
42. BM25 (Okapi âBest Matchâ 25th Iteration)
Score(q, d) =
â idf(t) ¡ ( tf(t in d) ¡ (k + 1) ) / ( tf(t in d) + k ¡ (1 â b + b ¡ |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = â 1
t in d
avgdl = = ( â |d| ) / ( â 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document normalization.
Greenville Data Science & Analytics
43. News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords canât hold itâs own against good
domain-specific relevance factors!
Thatâs great, but what about domain-specific knowledge?
Greenville Data Science & Analytics
44. *Example from chapter 16 of Solr in Action
Domain-specific relevancy calculation (News Website Example)
News website:
/select?
fq=$myQuery&
q=_query_:"{!func}scale(query($myQuery),0,100)"
AND _query_:"{!func}div(100,map(geodist(),0,1,1))"
AND _query_:"{!func}recip(rord(publicationDate),0,100,100)"
AND _query_:"{!func}scale(popularity,0,100)"&
myQuery="street festival"&
sfield=location&
pt=33.748,-84.391
25%
25%
25%
25%
Greenville Data Science & Analytics
48. The Three Câs
Content:
Keywords and other features in your documents
Collaboration:
How otherâs have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
âLeveraging previous data and interactions to improve how
new data and interactions should be interpretedâ
Greenville Data Science & Analytics
49. â Recommendation Algorithms
â Building user profiles from past searches, clicks, and other actions
â Identifying correlations between keywords/phrases
â Building out automatically-generated ontologies from content and queries
â Determining relevancy judgements (precision, recall, nDCG, etc.) from click
logs
â Learning to Rank - using relevancy judgements and machine learning to train
a relevance model
â Discovering misspellings, synonyms, acronyms, and related keywords
â Disambiguation of keyword phrases with multiple meanings
â Learning whatâs important in your content
Examples of Reflected Intelligence
Greenville Data Science & Analytics
50. John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location
in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a
Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
Greenville Data Science & Analytics
64. Learning to Rank (LTR)
â It applies machine learning techniques to discover the best combination
of features that provide best ranking.
â It requires labeled set of documents with relevancy scores for given set
of queries
â Features used for ranking are usually more computationally expensive
than the ones used for matching
â It typically re-ranks a subset of the matched documents (e.g. top 1000)
Greenville Data Science & Analytics
67. LambdaMart Example
Source: T. Grainger, K. AlJadda. âReflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
Greenville Data Science & Analytics
69. #1: Pull, Build, Start Solr
git clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solr
ant server
bin/solr -e techproducts -Dsolr.ltr.enabled=true
#2: Run Searches
http://localhost:8983/solr/techproducts/browse?q=ipod
#3: Supply User Relevancy Judgements
cd contrib/ltr/example/
nano user_queries.txt
#4: Install Training Library
curl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gz
tar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinear
cd liblinear && make && cd ../
#5: Train and Upload Model
./train_and_upload_demo_model.py -c config.json
#6: Re-run Searches using Machine-learned Ranking Model
http://localhost:8983/solr/techproducts/browse?q=ipod
&rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}
76. Streaming Expressions - Examples
Shortest-path Graph
Traversal
Parallel Batch
Procesing
Train a Logistic Regression
Model
Distributed Joins
Rapid Export of all
Search Results
Pull Results from External Database
Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classifying
Search Results
77. Graph Use Cases
⢠Anomaly detection /
fraud detection
⢠Recommenders
⢠Social network analysis
⢠Graph Search
⢠Access Control
⢠Relationship discovery / scoring
Examples
o Find all draft blog posts about âParallel SQLâ
written by a developer
o Find all tweets mentioning âSolrâ by me or people
I follow
o Find all draft blog posts about âParallel SQLâ
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
Greenville Data Science & Analytics
78. Solr Graph Timeline
⢠Some data is much more naturally represented as a graph
structure
⢠Solr 6.0: Introduced the Graph Query Parser
⢠Solr 6.1: Introduced Graph Streaming expressions
âŚ
⢠Solr 6.6: Current Version
⢠TBD: Semantic Knowledge Graph (patch available)
Greenville Data Science & Analytics
79. Graph Query Parser
⢠Query-time, cyclic aware graph traversal is able to rank documents based on relationships
⢠Provides controls for depth, filtering of results and inclusion
of root and/or leaves
⢠Limitations: single node/shard only
Examples:
⢠http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
⢠http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
⢠http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
Greenville Data Science & Analytics
80. Graph Streaming Expressions
⢠Part of Solrâs broader Streaming Expressions capability
⢠Implements a powerful, breadth-first traversal
⢠Works across shards AND collections
⢠Supports aggregations
⢠Cycle aware
curl -X POST -H "Content-Type: application/x-www-form-urlencoded"
-d âexpr=âŚâ"http://localhost:18984/solr/movielens/stream"
Greenville Data Science & Analytics
81. All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
Greenville Data Science & Analytics
82. All movies that viewers of a specific movie watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
Movie 161: âThe Air Up Thereâ
Greenville Data Science & Analytics
84. Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expressions: fast,
query-based, dist.
graph ops
Limited to sequential,
term relatedness
exploration only
Graph ops and
querying that fit on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: Co-located
Shards only
Expressions: Yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(e.g. Gephi)
Kibana Neo4j browser 3rd party
Greenville Data Science & Analytics
89. Greenville Data Science & Analytics
Audience Questions
#1: How can you figure out the meaning or intent of
keywords, particularly when there are multiple ways to
represent them or multiple meanings?
90. How do we handle phrases with ambiguous meanings?
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
⌠âŚ
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
91. A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph
Greenville Data Science & Analytics
92. Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job title
classifications of the jobs to
which they applied)
3) Segment the search term => related search terms list by classification,
to return a separate related terms list per classification
2) Create a probabilistic graphical model of those classifications mapped
to each keyword phrase.
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
93. Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you donât need query logs to mine, but it will be representative
of your data, as opposed to your userâs intent, so the quality depends on how clean and
representative your documents are.
Greenville Data Science & Analytics
94. Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
⌠âŚ
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
95. Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
⢠User is a software engineer
⢠User has previously run searches for âc++â and âlinuxâ
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Greenville Data Science & Analytics
96. Greenville Data Science & Analytics
Audience Questions
#2: Can you tell me more about the semantic knowledge
graph?
See:
http://www.treygrainger.com/posts/presentations/the-
semantic-knowledge-graph/