SlideShare a Scribd company logo
A Case for Sophistication
The modern requirements of search (Solr Focused)
Let’s get philosophical – for a moment.
George Gilder published a book in 1990 that made some fairly accurate predictions about the
future of computing.
”Wealth is Knowledge”
If wealth is knowledge, knowledge about
our domain, and the knowledge to model
it accurately could be said to be “value”.
”Value” is the proposition that drives
users to engage with search
In computer graphics, 3d models are simplified for real-
time applications (video games).
Fidelity is preserved by applying a high-fidelity proxy to
the lower-fidelity “real-time” representation.
This process is called “baking”.
In machine learning, when we ‘train’ a model, we are
‘baking’ knowledge into a more efficient representation.
The same is true for how we might enhance searches
by using external datasets, query statistics, LTR, etc.
Modeling a high-fidelity representation of data into a
real-time, more efficient form is key to climbing the
ladder of search sophistication.
Representing domain knowledge within our search
platform so that it provides value to our users is how
we achieve sophistication.
This is perhaps the greatest challenge in building
search products.
Our premise
Intent IS accuracy, recall IS relevancy
This may be controversial; recall vs accuracy is the wrong juxtaposition.
Our premise
Perhaps the best way this relationship can be described is:
- The fidelity of a domain model impacts recall
- Accuracy is linked to our domain model
- Relevancy is linked to accuracy
- Accuracy is best modeled by understanding intent
- Restrictive queries shouldn’t be presumed to be accurate.
Accuracy exists independent of the percent of documents matched.
Our premise
If accuracy is the ultimate goal, and ‘recall’ is apart of accuracy, how do we
go about achieving this?
Our premise
Intent
Disambiguation
Location
Title (Known Item)
Category
Conceptual
Modeling intent, allows you to have a conversation with your user.
Our premise
recall
entities
synonymy hierarchy
ontology
conceptual proximity
More sophisticated and higher resolution primitives within the index offers the
opportunity for recall to be more useful and more accurate.
Modeling Knowledge
Let’s return to discussing sophistication. Before we made the claim that knowledge is
what provides value. We also said that modeling knowledge is difficult.
Implementing maturity in our search platform is what allows us to model our domain
knowledge.
Modeling Knowledge
Modeling Knowledge
Observation… It’s really hard for most organizations to climb the sophistication ladder
that was shown in the previous slide.
Out of the box
Scorer (default similarity)
Query Handler (Edismax)
Import Handlers
Analyzers / TokenFilters
Boost Functions
What we need
Query Classifiers
ML Models
Behavior Sampling / Ingestion
Identity Awareness
Secondary Data Sources (data connectors)
Alternative forms of storage (inverted index)
Integrations (Spark, Airflow, etc)
Collections as “Containers” for behavior.
Modeling Knowledge
When we model our domain we want to model “things”, that we
can call “entities”.
Modeling entities in any domain can be extremely valuable.
Modeling Knowledge - Entities
1.) disambiguation for free
Modeling Knowledge - Entities
1.) disambiguation for free
2.) fairly easy to generate candidates for any domain
Modeling Knowledge - Entities
1.) disambiguation for free
2.) fairly easy to generate candidates for any domain
3.) fairly well researched area of ML
Modeling Knowledge - Entities
1.) disambiguation for free
2.) fairly easy to generate candidates for any domain
3.) fairly well researched area of ML
4.) helps in the modeling of “conceptual” synonyms
Modeling Knowledge - Entities
1.) disambiguation for free
2.) fairly easy to generate candidates for any domain
3.) fairly well researched area of ML
4.) helps in the modeling of “conceptual” synonyms
5.) must be pruned by user feedback / behavior
Modeling Knowledge - Entities
1.) disambiguation for free
2.) fairly easy to generate candidates for any domain
3.) fairly well researched area of ML
4.) helps in the modeling of “conceptual” synonyms
5.) must be pruned by user feedback / behavior
6.) ground work for higher-level more sophisticated features.
Modeling Knowledge - Entities
Ok, but why ?
Modeling Knowledge - Entities
Ok, but WHY?
Modeling Knowledge - Entities
Ok, but why ?
In the previous slide we saw that 40% of Target Corporations searches are low-information, and they
don’t know what they mean. Without modeling your corpus (the content you are searching) you
won’t be able to reason about the behavior or relationship between searches, actions, and ultimately
intent.
It is extremely common for a good portion of searches (half) to not provided the necessary
information to give relevant term-based search results.
This is at the core of the case for sophistication. Term search simply can’t provide useful results for
a large number of searches that your users are going to perform.
Modeling Knowledge - Entities
Modeling Knowledge – Truth Systems
entity feature value
plato isA philosopher
socrates isA philosopher
plato knew socrates
socrates knew plato
plato isA historical-figure
socrates isA historical-figure
Modeling Knowledge - Similarity
Socrates != Plato
- Related, but not the same
- One is not a subset of the other
- Found in many of the same documents
- Found in many of the same contexts
- This is where automatic similarity methods, fall down a bit.
Modeling Knowledge - Ontologies
Entities and ontologies can work together...
- when building ontologies there are different types of relationships.
- word2vec / phrase2vec, LSA, cannot be used by themselves.
- ontologies can be pruned and reshaped by supervised learning.
- ontologies can be reshaped by feature-systems (truth systems).
- most useful ontologies are modeled for a specific feature (product titles).
- query classifier can choose between similarity features / models.
Modeling Knowledge
Corpus Domain Model
Modeling Knowledge
We can’t simply rely on our corpus to provide us with the information necessary to model
our domain. We must use auxiliary data sources.
Fortunately there are many open data sources in the world that we can use to augment
our understanding of our corpus.
Modeling Knowledge
The Internet
Domain Model
Wikipedia
StackExchange
Amazon
Merchant
API
Modeling Knowledge - Entities
Entities and ontologies can work together...
- when building ontologies there are different types of relationships.
- word2vec / phrase2vec cannot be used by themselves
- ontologies can be pruned and reshaped by supervised learning
- ontologies can be reshaped by feature-systems (truth systems)
Modeling Knowledge - Entities
The
In the previous slides we saw entity mapping and grading of a job-search domain model. This
was accomplished by building candidate phrases and then pruning them using an SVM trained
from features from a known good data source with phrases and topics already labeled.
Also shown was a query classifier that takes a lazy or poorly constructed query, groups the
components of the query logically and expands part of the query based on information it knows
about the index and availability and relatedness of terms.
A model to classify queries can be built by understanding the relationship between search
entities, and the entities and information contained within a document.
SHReC is a Java package implementing a hierarchical document clustering algorithm based on a
statistical co-occurence measure called subsumption.
The algorithm is particularly suited to the problem of on-line "search results" clustering, requiring little
amounts of text data. - http://shrec.sourceforge.net/
Search Action Document
SHReC along with an entity model can be used to prune, grade, and reorganize an ontology to better
understand the types and accuracy of relationships. Algorithms used to cluster behavior with search
terms are invaluable in modeling search intent and rewriting search queries.
The perfect combination of phrase
boosting, multi-term synonyms, term
position (proximity) and performance is
a frequent question within the
community.
Exact Phrase Matches → PhraseQuery / SpanQuery
Proximity of Terms → SpanQuery
Related Phrases → Payloads / Index Time Synonyms
Currently in Solr there is no built-in way to represent related entities efficiently. Query rewriting or
expansion can be performed at query time, but not all relationships can be modeled at query time
due to the complexity of the query.
Different classifications of synonym within the index are an option, as well as payloads being used to
assign relatedness scores to a given entity.
All index-side synonym solutions are quite custom and are not easy to quickly implement.
Better tools are needed to correctly model graphs of terms or entities and to create rules for how and
when to rewrite search queries without using crude rule based systems.
Conclusion
- Modeling the world through language is hard.
Conclusion
- Modeling the world through language is hard.
- Modeling phrases and entities makes life a little easier.
Conclusion
- Modeling the world through language is hard.
- Modeling phrases and entities makes life a little easier.
- Phrases form the basis of relationships.
Conclusion
- Modeling the world through language is hard.
- Modeling phrases and entities makes life a little easier.
- Phrases form the basis of relationships.
- Accuracy should be proportional to confidence

More Related Content

What's hot

Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
Trey Grainger
 
Interleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsInterleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904Labs
John T. Kane
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
PromptCloud
 
Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market
Xun Wang
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
Trey Grainger
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
Anil Shrestha
 
Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)
Gabriel Moreira
 

What's hot (14)

Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
Interleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsInterleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904Labs
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
02 Web Search
02 Web Search02 Web Search
02 Web Search
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market Haystack- Learning to rank in an hourly job market
Haystack- Learning to rank in an hourly job market
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)
 

Similar to The need for sophistication in modern search engine implementations

Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
voginip
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
Michael King
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
Marianne Sweeny
 
Transform unstructured e&p information
Transform unstructured e&p informationTransform unstructured e&p information
Transform unstructured e&p information
Stig-Arne Kristoffersen
 
ML crash course
ML crash courseML crash course
ML crash course
mikaelhuss
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Karen Thompson
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic WaveKaniska Mandal
 
3 Understanding Search
3 Understanding Search3 Understanding Search
3 Understanding Search
masiclat
 
TechDayPakistan-Slides RAG with Cosmos DB.pptx
TechDayPakistan-Slides RAG with Cosmos DB.pptxTechDayPakistan-Slides RAG with Cosmos DB.pptx
TechDayPakistan-Slides RAG with Cosmos DB.pptx
Usama Wahab Khan Cloud, Data and AI
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Cataldo Musto
 
Hybrid use of machine learning and ontology
Hybrid use of machine learning and ontologyHybrid use of machine learning and ontology
Hybrid use of machine learning and ontology
Anthony (Tony) Sarris
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
Dhwaj Raj
 
User friendly pattern search paradigm
User friendly pattern search paradigmUser friendly pattern search paradigm
User friendly pattern search paradigm
Migrant Systems
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
Ted Sullivan
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
Kumar Goud
 
Bt8901 objective oriented systems1
Bt8901 objective oriented systems1Bt8901 objective oriented systems1
Bt8901 objective oriented systems1
Techglyphs
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
Marianne Sweeny
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptx
ArifaMehreen1
 

Similar to The need for sophistication in modern search engine implementations (20)

Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
Transform unstructured e&p information
Transform unstructured e&p informationTransform unstructured e&p information
Transform unstructured e&p information
 
ML crash course
ML crash courseML crash course
ML crash course
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
 
3 Understanding Search
3 Understanding Search3 Understanding Search
3 Understanding Search
 
TechDayPakistan-Slides RAG with Cosmos DB.pptx
TechDayPakistan-Slides RAG with Cosmos DB.pptxTechDayPakistan-Slides RAG with Cosmos DB.pptx
TechDayPakistan-Slides RAG with Cosmos DB.pptx
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
 
Hybrid use of machine learning and ontology
Hybrid use of machine learning and ontologyHybrid use of machine learning and ontology
Hybrid use of machine learning and ontology
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
 
User friendly pattern search paradigm
User friendly pattern search paradigmUser friendly pattern search paradigm
User friendly pattern search paradigm
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Bt8901 objective oriented systems1
Bt8901 objective oriented systems1Bt8901 objective oriented systems1
Bt8901 objective oriented systems1
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptx
 

Recently uploaded

How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 

Recently uploaded (20)

How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 

The need for sophistication in modern search engine implementations

  • 1. A Case for Sophistication The modern requirements of search (Solr Focused)
  • 2. Let’s get philosophical – for a moment.
  • 3. George Gilder published a book in 1990 that made some fairly accurate predictions about the future of computing.
  • 5. If wealth is knowledge, knowledge about our domain, and the knowledge to model it accurately could be said to be “value”. ”Value” is the proposition that drives users to engage with search
  • 6. In computer graphics, 3d models are simplified for real- time applications (video games). Fidelity is preserved by applying a high-fidelity proxy to the lower-fidelity “real-time” representation. This process is called “baking”.
  • 7. In machine learning, when we ‘train’ a model, we are ‘baking’ knowledge into a more efficient representation. The same is true for how we might enhance searches by using external datasets, query statistics, LTR, etc. Modeling a high-fidelity representation of data into a real-time, more efficient form is key to climbing the ladder of search sophistication.
  • 8. Representing domain knowledge within our search platform so that it provides value to our users is how we achieve sophistication. This is perhaps the greatest challenge in building search products.
  • 9. Our premise Intent IS accuracy, recall IS relevancy This may be controversial; recall vs accuracy is the wrong juxtaposition.
  • 10. Our premise Perhaps the best way this relationship can be described is: - The fidelity of a domain model impacts recall - Accuracy is linked to our domain model - Relevancy is linked to accuracy - Accuracy is best modeled by understanding intent - Restrictive queries shouldn’t be presumed to be accurate. Accuracy exists independent of the percent of documents matched.
  • 11. Our premise If accuracy is the ultimate goal, and ‘recall’ is apart of accuracy, how do we go about achieving this?
  • 12. Our premise Intent Disambiguation Location Title (Known Item) Category Conceptual Modeling intent, allows you to have a conversation with your user.
  • 13. Our premise recall entities synonymy hierarchy ontology conceptual proximity More sophisticated and higher resolution primitives within the index offers the opportunity for recall to be more useful and more accurate.
  • 14. Modeling Knowledge Let’s return to discussing sophistication. Before we made the claim that knowledge is what provides value. We also said that modeling knowledge is difficult. Implementing maturity in our search platform is what allows us to model our domain knowledge.
  • 16. Modeling Knowledge Observation… It’s really hard for most organizations to climb the sophistication ladder that was shown in the previous slide.
  • 17. Out of the box Scorer (default similarity) Query Handler (Edismax) Import Handlers Analyzers / TokenFilters Boost Functions
  • 18. What we need Query Classifiers ML Models Behavior Sampling / Ingestion Identity Awareness Secondary Data Sources (data connectors) Alternative forms of storage (inverted index) Integrations (Spark, Airflow, etc) Collections as “Containers” for behavior.
  • 19. Modeling Knowledge When we model our domain we want to model “things”, that we can call “entities”. Modeling entities in any domain can be extremely valuable.
  • 20. Modeling Knowledge - Entities 1.) disambiguation for free
  • 21. Modeling Knowledge - Entities 1.) disambiguation for free 2.) fairly easy to generate candidates for any domain
  • 22. Modeling Knowledge - Entities 1.) disambiguation for free 2.) fairly easy to generate candidates for any domain 3.) fairly well researched area of ML
  • 23. Modeling Knowledge - Entities 1.) disambiguation for free 2.) fairly easy to generate candidates for any domain 3.) fairly well researched area of ML 4.) helps in the modeling of “conceptual” synonyms
  • 24. Modeling Knowledge - Entities 1.) disambiguation for free 2.) fairly easy to generate candidates for any domain 3.) fairly well researched area of ML 4.) helps in the modeling of “conceptual” synonyms 5.) must be pruned by user feedback / behavior
  • 25. Modeling Knowledge - Entities 1.) disambiguation for free 2.) fairly easy to generate candidates for any domain 3.) fairly well researched area of ML 4.) helps in the modeling of “conceptual” synonyms 5.) must be pruned by user feedback / behavior 6.) ground work for higher-level more sophisticated features.
  • 26. Modeling Knowledge - Entities Ok, but why ?
  • 27. Modeling Knowledge - Entities Ok, but WHY?
  • 28. Modeling Knowledge - Entities Ok, but why ? In the previous slide we saw that 40% of Target Corporations searches are low-information, and they don’t know what they mean. Without modeling your corpus (the content you are searching) you won’t be able to reason about the behavior or relationship between searches, actions, and ultimately intent. It is extremely common for a good portion of searches (half) to not provided the necessary information to give relevant term-based search results. This is at the core of the case for sophistication. Term search simply can’t provide useful results for a large number of searches that your users are going to perform.
  • 30. Modeling Knowledge – Truth Systems entity feature value plato isA philosopher socrates isA philosopher plato knew socrates socrates knew plato plato isA historical-figure socrates isA historical-figure
  • 31. Modeling Knowledge - Similarity Socrates != Plato - Related, but not the same - One is not a subset of the other - Found in many of the same documents - Found in many of the same contexts - This is where automatic similarity methods, fall down a bit.
  • 32. Modeling Knowledge - Ontologies Entities and ontologies can work together... - when building ontologies there are different types of relationships. - word2vec / phrase2vec, LSA, cannot be used by themselves. - ontologies can be pruned and reshaped by supervised learning. - ontologies can be reshaped by feature-systems (truth systems). - most useful ontologies are modeled for a specific feature (product titles). - query classifier can choose between similarity features / models.
  • 34. Modeling Knowledge We can’t simply rely on our corpus to provide us with the information necessary to model our domain. We must use auxiliary data sources. Fortunately there are many open data sources in the world that we can use to augment our understanding of our corpus.
  • 35. Modeling Knowledge The Internet Domain Model Wikipedia StackExchange Amazon Merchant API
  • 36. Modeling Knowledge - Entities Entities and ontologies can work together... - when building ontologies there are different types of relationships. - word2vec / phrase2vec cannot be used by themselves - ontologies can be pruned and reshaped by supervised learning - ontologies can be reshaped by feature-systems (truth systems)
  • 37. Modeling Knowledge - Entities The
  • 38. In the previous slides we saw entity mapping and grading of a job-search domain model. This was accomplished by building candidate phrases and then pruning them using an SVM trained from features from a known good data source with phrases and topics already labeled. Also shown was a query classifier that takes a lazy or poorly constructed query, groups the components of the query logically and expands part of the query based on information it knows about the index and availability and relatedness of terms. A model to classify queries can be built by understanding the relationship between search entities, and the entities and information contained within a document.
  • 39. SHReC is a Java package implementing a hierarchical document clustering algorithm based on a statistical co-occurence measure called subsumption. The algorithm is particularly suited to the problem of on-line "search results" clustering, requiring little amounts of text data. - http://shrec.sourceforge.net/ Search Action Document SHReC along with an entity model can be used to prune, grade, and reorganize an ontology to better understand the types and accuracy of relationships. Algorithms used to cluster behavior with search terms are invaluable in modeling search intent and rewriting search queries.
  • 40. The perfect combination of phrase boosting, multi-term synonyms, term position (proximity) and performance is a frequent question within the community.
  • 41. Exact Phrase Matches → PhraseQuery / SpanQuery Proximity of Terms → SpanQuery Related Phrases → Payloads / Index Time Synonyms
  • 42. Currently in Solr there is no built-in way to represent related entities efficiently. Query rewriting or expansion can be performed at query time, but not all relationships can be modeled at query time due to the complexity of the query. Different classifications of synonym within the index are an option, as well as payloads being used to assign relatedness scores to a given entity. All index-side synonym solutions are quite custom and are not easy to quickly implement. Better tools are needed to correctly model graphs of terms or entities and to create rules for how and when to rewrite search queries without using crude rule based systems.
  • 43. Conclusion - Modeling the world through language is hard.
  • 44. Conclusion - Modeling the world through language is hard. - Modeling phrases and entities makes life a little easier.
  • 45. Conclusion - Modeling the world through language is hard. - Modeling phrases and entities makes life a little easier. - Phrases form the basis of relationships.
  • 46. Conclusion - Modeling the world through language is hard. - Modeling phrases and entities makes life a little easier. - Phrases form the basis of relationships. - Accuracy should be proportional to confidence

Editor's Notes

  1. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  2. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  3. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  4. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  5. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  6. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  7. With that said, There is a relationship between recall and accuracy, that’s not up for debate. What is often missed in the discussion of recall and its relationship to intent.
  8. To have “expanded” recall we must model our domain, to do this we need entities and to understand the relationship between them.
  9. To have “expanded” recall we must model our domain, to do this we need entities and to understand the relationship between them.
  10. Intent is domain specific, so we want to find easier ways to model it When we model intent we want to do all the things you do with search, test it, debug it, update it, reinforce it with judgements.
  11. To have “expanded” recall we must model our domain, to do this we need entities and to understand the relationship between them.
  12. To have “expanded” recall we must model our domain, to do this we need entities and to understand the relationship between them.
  13. To have “expanded” recall we must model our domain, to do this we need entities and to understand the relationship between them.
  14. To have “expanded” recall we must model our domain, to do this we need entities and to understand the relationship between them.
  15. Solr, Elastic Search provide good primitives out of the box Demands of modern search applications require more layered and sophisticated primitives
  16. So I’ve talked about modeling entities, but why do we need to do this… can’t be get most of the way there with traditional search, hasn’t it worked fine for most people until now?
  17. This is a slide from Target corp presentation A lot of their searches are long-tail with poor matches Also, when their model can’t match both words within a given category, they fall-back to search for only 1 term that has the most matches. This is a great example where understanding the relationship between searches and entities can be very important.
  18. So I’ve talked about modeling entities, but why do we need to do this… can’t be get most of the way there with traditional search, hasn’t it worked fine for most people until now?
  19. Single term searches are a huge issue in job search, and interpreting what they mean is a challenge. Modeling entities in our domain helps us begin to understand the relationship between term searches and the types of documents viewed. Entity approaches can also help us understand the individual searchers affinity within ambiguous contexts.
  20. You can imagine a search system in which we are modeling people. This might be useful for a library or research system.
  21. Which brings us to traditional challenges with similarity and where it can go off the rails.
  22. Ontologies are a huge subject, but really what we are describing is a graph database with edges that are informed based on what we know about a particular entity. One entity may be perfectly related to another. If we were building a library data-system we might have an ontology of historical persons. The ontology might form features to tell us if the person was an inventor or politician.
  23. Ontologies are a huge subject, but really what we are describing is a graph database with edges that are informed based on what we know about a particular entity. One entity may be perfectly related to another. If we were building a library data-system we might have an ontology of historical persons. The ontology might form features to tell us if the person was an inventor or politician.
  24. A simple ontology can be constructed for job titles from query logs and reviewed / pruned by hand. Supervised ML approaches can get you pretty close to this as well.
  25. Conceptual relationships from phrases are easier to model than words or simpler language
  26. - accuracy or broadness of a search query should be related to how well we understand what’s being searched for - In the absence of high confidence search should fall back to a default algorithm.