The need for sophistication in modern search engine implementations

A Case for Sophistication
The modern requirements of search (Solr Focused)

Let’s get philosophical – for a moment.

George Gilder published a book in 1990 that made some fairly accurate predictions about the
future of computing.

If wealth is knowledge, knowledge about
our domain, and the knowledge to model
it accurately could be said to be “value”.
”Value” is the proposition that drives
users to engage with search

In computer graphics, 3d models are simplified for real-
time applications (video games).
Fidelity is preserved by applying a high-fidelity proxy to
the lower-fidelity “real-time” representation.
This process is called “baking”.

In machine learning, when we ‘train’ a model, we are
‘baking’ knowledge into a more efficient representation.
The same is true for how we might enhance searches
by using external datasets, query statistics, LTR, etc.
Modeling a high-fidelity representation of data into a
real-time, more efficient form is key to climbing the
ladder of search sophistication.

Representing domain knowledge within our search
platform so that it provides value to our users is how
we achieve sophistication.
This is perhaps the greatest challenge in building
search products.

Our premise
Intent IS accuracy, recall IS relevancy
This may be controversial; recall vs accuracy is the wrong juxtaposition.

Our premise
Perhaps the best way this relationship can be described is:
- The fidelity of a domain model impacts recall
- Accuracy is linked to our domain model
- Relevancy is linked to accuracy
- Accuracy is best modeled by understanding intent
- Restrictive queries shouldn’t be presumed to be accurate.
Accuracy exists independent of the percent of documents matched.

Our premise
If accuracy is the ultimate goal, and ‘recall’ is apart of accuracy, how do we
go about achieving this?

Our premise
Intent
Disambiguation
Location
Title (Known Item)
Category
Conceptual
Modeling intent, allows you to have a conversation with your user.

Our premise
recall
entities
synonymy hierarchy
ontology
conceptual proximity
More sophisticated and higher resolution primitives within the index offers the
opportunity for recall to be more useful and more accurate.

Modeling Knowledge
Let’s return to discussing sophistication. Before we made the claim that knowledge is
what provides value. We also said that modeling knowledge is difficult.
Implementing maturity in our search platform is what allows us to model our domain
knowledge.

Modeling Knowledge
Observation… It’s really hard for most organizations to climb the sophistication ladder
that was shown in the previous slide.

Out of the box
Scorer (default similarity)
Query Handler (Edismax)
Import Handlers
Analyzers / TokenFilters
Boost Functions

What we need
Query Classifiers
ML Models
Behavior Sampling / Ingestion
Identity Awareness
Secondary Data Sources (data connectors)
Alternative forms of storage (inverted index)
Integrations (Spark, Airflow, etc)
Collections as “Containers” for behavior.

Modeling Knowledge
When we model our domain we want to model “things”, that we
can call “entities”.
Modeling entities in any domain can be extremely valuable.

Modeling Knowledge - Entities
1.) disambiguation for free

2.) fairly easy to generate candidates for any domain

3.) fairly well researched area of ML

4.) helps in the modeling of “conceptual” synonyms

5.) must be pruned by user feedback / behavior

5.) must be pruned by user feedback / behavior
6.) ground work for higher-level more sophisticated features.

Ok, but why ?

Ok, but WHY?

Ok, but why ?
In the previous slide we saw that 40% of Target Corporations searches are low-information, and they
don’t know what they mean. Without modeling your corpus (the content you are searching) you
won’t be able to reason about the behavior or relationship between searches, actions, and ultimately
intent.
It is extremely common for a good portion of searches (half) to not provided the necessary
information to give relevant term-based search results.
This is at the core of the case for sophistication. Term search simply can’t provide useful results for
a large number of searches that your users are going to perform.

Modeling Knowledge – Truth Systems
entity feature value
plato isA philosopher
socrates isA philosopher
plato knew socrates
socrates knew plato
plato isA historical-figure
socrates isA historical-figure

Modeling Knowledge - Similarity
Socrates != Plato
- Related, but not the same
- One is not a subset of the other
- Found in many of the same documents
- Found in many of the same contexts
- This is where automatic similarity methods, fall down a bit.

Modeling Knowledge - Ontologies
Entities and ontologies can work together...
- when building ontologies there are different types of relationships.
- word2vec / phrase2vec, LSA, cannot be used by themselves.
- ontologies can be pruned and reshaped by supervised learning.
- ontologies can be reshaped by feature-systems (truth systems).
- most useful ontologies are modeled for a specific feature (product titles).
- query classifier can choose between similarity features / models.

Modeling Knowledge
Corpus Domain Model

Modeling Knowledge
We can’t simply rely on our corpus to provide us with the information necessary to model
our domain. We must use auxiliary data sources.
Fortunately there are many open data sources in the world that we can use to augment
our understanding of our corpus.

Modeling Knowledge
The Internet
Domain Model
Wikipedia
StackExchange
Amazon
Merchant
API

Entities and ontologies can work together...
- when building ontologies there are different types of relationships.
- word2vec / phrase2vec cannot be used by themselves
- ontologies can be pruned and reshaped by supervised learning
- ontologies can be reshaped by feature-systems (truth systems)

The

In the previous slides we saw entity mapping and grading of a job-search domain model. This
was accomplished by building candidate phrases and then pruning them using an SVM trained
from features from a known good data source with phrases and topics already labeled.
Also shown was a query classifier that takes a lazy or poorly constructed query, groups the
components of the query logically and expands part of the query based on information it knows
about the index and availability and relatedness of terms.
A model to classify queries can be built by understanding the relationship between search
entities, and the entities and information contained within a document.

SHReC is a Java package implementing a hierarchical document clustering algorithm based on a
statistical co-occurence measure called subsumption.
The algorithm is particularly suited to the problem of on-line "search results" clustering, requiring little
amounts of text data. - http://shrec.sourceforge.net/
Search Action Document
SHReC along with an entity model can be used to prune, grade, and reorganize an ontology to better
understand the types and accuracy of relationships. Algorithms used to cluster behavior with search
terms are invaluable in modeling search intent and rewriting search queries.

The perfect combination of phrase
boosting, multi-term synonyms, term
position (proximity) and performance is
a frequent question within the
community.

Exact Phrase Matches → PhraseQuery / SpanQuery
Proximity of Terms → SpanQuery
Related Phrases → Payloads / Index Time Synonyms

Currently in Solr there is no built-in way to represent related entities efficiently. Query rewriting or
expansion can be performed at query time, but not all relationships can be modeled at query time
due to the complexity of the query.
Different classifications of synonym within the index are an option, as well as payloads being used to
assign relatedness scores to a given entity.
All index-side synonym solutions are quite custom and are not easy to quickly implement.
Better tools are needed to correctly model graphs of terms or entities and to create rules for how and
when to rewrite search queries without using crude rule based systems.

Conclusion
- Modeling the world through language is hard.

Conclusion
- Modeling phrases and entities makes life a little easier.

Conclusion
- Phrases form the basis of relationships.

Conclusion
- Phrases form the basis of relationships.
- Accuracy should be proportional to confidence

The need for sophistication in modern search engine implementations

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to The need for sophistication in modern search engine implementations

Similar to The need for sophistication in modern search engine implementations (20)

Recently uploaded

Recently uploaded (20)

The need for sophistication in modern search engine implementations

Editor's Notes