The Solr (Multi-Terms) Synonyms Maze (Graphs)

© OpenSource Connections, 2018
The Solr Synonyms Maze
Haystack - The Search Relevance Conference
April 10-11, 2018
1
Bertrand Rigaldies
Search Consultant
OpenSource Connections
brigaldies@o19s.com
Linkedin: bertrandrigaldies
Multi-Terms Of Graphs
© OpenSource Connections, 2018
Agenda
1. Analysis Chain, Tokens, and Tokens Graph
Explained
2. Issue #1: Index-Time “Sausagization”
3. Issue #2: Graphs Interactions
4. Workarounds
5. Recommendations Recap
2
© OpenSource Connections, 2018
Tokens, and Tokens Graph
The “analysis chain” of content to index, or a
search string, produces a “tokens graph”
3
Synonyms:
scifi, sci fi, science fiction
End-user’s search/query:
q=scifi movies
© OpenSource Connections, 2018
4
Analysis
Chain,
and
Tokens
Separate terms (Tokenization) → tokens
Remove stop words
Lower case
Remove possessive form (‘s)
Protect selected terms against stemming
Stemming
Expand With Synonyms:
scifi, sci fi, science fiction
© OpenSource Connections, 2018
“Tokens Graph”
5
1 2 3 4 (End)
X --sci------>X --fi--------------------->X
X --science-----------------X--fiction--->X
X --scifi-------------------------------->X
4
(End)
© OpenSource Connections, 2018
6
Powered by Jupyter notebook
Query-Time Synonyms
Synonyms: scifi, sci fi, science fiction
q=sci fi movies
👍
Document: … science fiction …
defType=edismax
*** sow=false *** (Query-time: Split on whitespace)
qf=overview_query_syn
+(DisjunctionMaxQuery(((
(+overview_query_syn:sci +overview_query_syn:fi)
(+overview_query_syn:science
+overview_query_syn:fiction)
overview_query_syn:scifi)))
DisjunctionMaxQuery((overview_query_syn:movies)))
qf=overview_query_syn_auto_phrase
autoGeneratePhraseQueries=”true”
+(DisjunctionMaxQuery(((
overview_query_syn_auto_phrase:"sci fi"
overview_query_syn_auto_phrase:"science fiction"
overview_query_syn_auto_phrase:scifi)))
DisjunctionMaxQuery((overview_query_syn_auto_phrase:
movies)))
© OpenSource Connections, 2018
7Powered by Jupyter notebook
Query vs Index
Synonyms: scifi, sci fi, science fiction
q=sci fi movies
👍
Unexpected phrase matches!
“scifi fiction movies”
“scifi fi movies”
Unexpected not matching!
“scifi movies”
Query-Time Index-Time
All graph edges are forced
to the next node!
Document: …
scifi movies …
👍
© OpenSource Connections, 2018
Issue #1: Index-Time Multi-Terms
Synonyms “Sausage”
● First documented by Michael McCandless
blog in 2012:
○ “the indexer completely ignores PositionLength
attribute; ... This means the indexer acts as if all
arcs always arrive at the very next position”
● Also, nicely explained in this LucidWorks
2017 blog by Steve Rowe.
8
© OpenSource Connections, 2018
Issue #1 Workaround (1/2): Avoid Indexing
Multi-Terms Synonyms
9
© OpenSource Connections, 2018
10
Split the synonyms into separate files for single- and multi-terms:
<fieldType name="text_syn_split" …
autoGeneratePhraseQueries=”true”>
<analyzer type="index">
…
<filter … synonyms="synonyms_single.txt"/>
…
</analyzer>
<analyzer type="query">
…
<filter … synonyms="synonyms_multi.txt"/>
…
</analyzer>
</fieldType>
Issue #1 Workaround (1/2): Configuration
👍
+DisjunctionMaxQuery((((
overview_query_syn_auto_phrase:"sci fi"
overview_query_syn_auto_phrase:"science fiction"
overview_query_syn_auto_phrase:scifi))))
© OpenSource Connections, 2018
Issue #1 Workaround (2/2):
Semantic Unit Injection
syn1, syn2, … synN => __semantic_unit__
● Query-Time Injection:
Synonyms declaration: scifi, sci fi, science fiction => __science_fiction__
Query sci fi movies is analyzed as __science_fiction__ movies
The parsed query is
+(DisjunctionMaxQuery((overview_syn_sem_unit:__science_fiction__))
DisjunctionMaxQuery((overview_syn_sem_unit:movies)))
● Index-Time Injection:
Document: sci fi movie on the invasion of earth is analyzed as
__science_fiction__ movie on the invasion of earth
11
© OpenSource Connections, 2018
12
ATTENTION: Semantic units must be
injected at both Query and Index
times so that any one of the
synonyms in the search string and the
index do match!
Issue #1 Workaround (2/2 cont’d): Graph
q=”scifi movies”
q=”sci fi movies”
q=”science fiction movies”
👍
Document: scifi movies
Document: sci fi movies
Document: science fiction movies
© OpenSource Connections, 2018
Issue #1 Workaround (2b/2): Taxonomies
term1 term2 term => __hyponym_semantic_unit__, hypernym
● Query-Time Conversion:
Synonyms declaration: scifi, scifi, science fiction => __science_fiction__,
__fiction__
Query sci fi movies is analyzed as (__science_fiction__ __fiction__) movies
The parsed query is
+(DisjunctionMaxQuery((Synonym(overview_index_taxonomy:__fiction__
overview_index_taxonomy:__science_fiction__)))
DisjunctionMaxQuery((overview_index_taxonomy:movies)))
● Index-Time Conversion:
Document: sci fi movie on the invasion of earth is analyzed as
(__science_fiction__ __fiction__) movie ...
Document: political fiction movie on the rise of fascism is analyzed as
(__political_fiction__ __fiction__) movie ...
13
© OpenSource Connections, 2018
14
Issue #1 Workaround (2b/2 cont’d): Graph
q=sci fi movies
+(
DisjunctionMaxQuery((
Synonym(
overview_index_taxonomy:__fiction__
overview_index_taxonomy:__science_fiction__)
))
DisjunctionMaxQuery((
Overview_index_taxonomy:movies
))
)
Document: political fiction movie
Relevancy impact?
DF(__fiction__) >= DF(__science_fiction__)
Then, other movies only matching on __science__
have a lower score than the science fiction movies
matching on both __science__ and
__science_fiction__
© OpenSource Connections, 2018
Issue #2:
Intra-Analysis Graphs Interactions
Issue when chaining several graph-producing filters such as
SynonymGraphFilter and WordDelimiterGraphFilter in a field analysis
chain.
Not very well documented, or explained:
● Solr WordDelimiterGraphFilter documentation: “Note: although this
filter produces correct token graphs, it cannot consume an input token graph
correctly.” Not mentioned in the documentation of
SynonymGraphFilter!?
● Bottom of JIRA ticket LUCENE-6664: “This new syn filter still cannot
consume an arbitrary graph.”
● SynonymGraphFilter.java source code: “this cannot consume an incoming
graph; results will be undefined.”
15
© OpenSource Connections, 2018
Issue #2: Graph-Producing Filters Interaction
● Use Case #1: Several synonyms filters
<analyzer type="query">
…
<filter … synonyms="synonyms_multi.txt"/>
<filter … synonyms="synonyms_multi_2.txt"/>
…
</analyzer>
Synonyms_multi.txt:
scifi, sci fi, science fiction
Synonyms_multi_2.txt:
fiction, story telling
16
+DisjunctionMaxQuery(((((+overview_syn_split:sci +overview_syn_split:fi +overview_syn_split:telling)
(+overview_syn_split:science +overview_syn_split:fiction) (+overview_syn_split:science +overview_syn_split:story
+overview_syn_split:telling) (+overview_syn_split:scifi +overview_syn_split:telling)))))
👍
Terms overlap!
© OpenSource Connections, 2018
Issue #2: Graph-Producing Filters Interaction
17
● Use Case #1 Workaround:
✅ Do not chain several Synonyms-producing graph filters: Gather all multi-
terms synonyms in a single file.
✅ Be careful with any combos of synonyms files:
<filter … synonyms="synonyms_single.txt"/>
<filter … synonyms="synonyms_multi.txt"/>
fiction, fable
scifi, sci fi, science fiction
Avoid terms overlap!
© OpenSource Connections, 2018
Issue #2: Graph-Producing Filters Interaction
● Use Case #2: Word Delimiter + Synonyms Combo
<analyzer type="query">
…
<filter class="solr.WordDelimiterGraphFilterFactory"
protected="content_wdf_protected_terms.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
stemEnglishPossessive="1"
preserveOriginal="1"/>
…
<filter … synonyms="synonyms_multi.txt"/>
…
</analyzer>
18
scifi, sci fi, science fiction
Sci-fi →
1 2 3
--sci-fi------------>
--scifi------------->
--sci------>--fi---->
© OpenSource Connections, 2018
Issue #2: Graph-Producing Filters Interaction
Use Case #2 (Cont’d)
19
+DisjunctionMaxQuery(((((+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi)
(+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:science +overview_wdf_query_syn:fiction)
(+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:scifi) (+overview_wdf_query_syn:science
+overview_wdf_query_syn:fiction +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:science
+overview_wdf_query_syn:fiction +overview_wdf_query_syn:science +overview_wdf_query_syn:fiction) (+overview_wdf_query_syn:science
+overview_wdf_query_syn:fiction +overview_wdf_query_syn:scifi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:sci-fi
+overview_wdf_query_syn:fi) (+overview_wdf_query_syn:scifi +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi)
Word Delimiter-produced
Graph
Synonym-produced Graph
next next
q=sci-fi
👍
© OpenSource Connections, 2018
Issue #2: Graph-Producing Filters Interaction
20
● Use Case #2 Workaround: Configure the Word Delimiter Graph filter to NOT
produce a graph:
<analyzer type="query">
…
<filter class="solr.WordDelimiterGraphFilterFactory"
protected="content_wdf_protected_terms.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"
stemEnglishPossessive="1"
preserveOriginal="0"/>
…
<filter … synonyms="synonyms_multi.txt"/>
…
</analyzer>
Sci-fi →
1 2 3
--sci------>--fi---->
© OpenSource Connections, 2018
Issue #2: Graph-Producing Filters Interaction
Use Case #2 Workaround Graphs
21
q=sci-fi
next next next
+DisjunctionMaxQuery(((((+overview_wdf_no_concat_query_syn:sci
+overview_wdf_no_concat_query_syn:fi) (+overview_wdf_no_concat_query_syn:science
+overview_wdf_no_concat_query_syn:fiction) overview_wdf_no_concat_query_syn:scifi))))
👍👍
© OpenSource Connections, 2018
What if you need all parts produced by the Word Delimiter, not just the split parts?
For example, let’s say we have 3 documents titled:
1. Lowcost Appliances Repair
2. Low Cost Appliances Parts
3. Affordable Appliances Fixing
End-users search with: q=low-cost, lowcost, or low cost
We have the synonyms: low cost, affordable
Our title analysis chain used the White Space tokenizer, Word Delimiter with split-parts only, and a Synonyms filter.
Then, q=lowcost matches document #1 only.
lowcost → [WT] lowcost → [WDGF] lowcost → [SGF] lowcost
q=low cost matches document #2 and #3 only!
low cost → [WT] low | cost → [WDGF] low | cost → [SGF] (low affordable) | cost
q=low-cost matches document #2 and #3 only!
low-cost → [WT] low-cost → [WDGF] low | cost → [SGF] (low affordable) | cost
22
Issue #2: Graph-Producing Filters Interaction
Use Case #2 Workaround Variation
© OpenSource Connections, 2018
● Generally, we like the decompounding done by Word Delimiter.
● So, we’ll let it generate all split and concatenated parts, e.g., for “a-b”:
○ (a ab a-b) | b
● But, for those terms also present in synonyms, we’ll protect them from
decompounding in the Word Delimiter (see the protected parameter),
and do the decompounding in the synonyms file:
○ low-cost, lowcost, low cost, affordable
23
Issue #2: Graph-Producing Filters Interaction
Use Case #2 Workaround Variation (Cont’d)
Search string WT WDGF SGF
lowcost lowcost lowcost (lowcost low-cost low affordable) | cost
low cost low | cost low | cost (lowcost low-cost low affordable) | cost
low-cost low-cost low-cost
(protected)
(lowcost low-cost low affordable) | cost
© OpenSource Connections, 2018
The “Flatten Graph” Filter
24
<analyzer type="index">
…
<filter class="solr.WordDelimiterGraphFilterFactory" … />
<filter class="solr.FlattenGraphFilterFactory"/>
…
<filter class="solr.SynonymGraphFilterFactory" … synonyms="synonyms_multi_sem_unit.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" … synonyms="synonyms_single.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/>
…
</analyzer>
© OpenSource Connections, 2018
The “Flatten Graph” Filter (Cont’d)
25
Synonyms Expansion Graph Flattening
next
Matching phrases: “scifi fi”
“scifi fiction”
“sci fiction”
“science fi”
© OpenSource Connections, 2018
Multi-terms Synonyms: Recommendations
● Split single- and multi-terms synonyms in separate files for configuration
flexibility.
○ Index-time single-terms synonyms
○ Index-time semantic units-based synonyms
○ Query-time multi-terms synonyms (sow=false!)
● Watch out for combinations of filters that produce non-flat graphs that
overlap for one or more terms:
○ Two or more SynonymGraphFilter
○ WordDelimiterFilter (Split parts only) + SynonymGraphFilter
26
© OpenSource Connections, 2018
A Working Configuration<analyzer type="index">
<charFilter … />
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter
class="solr.WordDelimiterGraphFilterFactory"
protected="wdf_protected_terms.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"
stemEnglishPossessive="1"
preserveOriginal="0"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter
class="solr.EnglishMinimalStemFilterFactory"/>
<filter … synonyms="synonyms_single.txt"/>
<filter … synonyms="synonyms_sem_units.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/>
… 27
<analyzer type="query">
<charFilter … />
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter
class="solr.WordDelimiterGraphFilterFactory"
protected="wdf_protected_terms.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"
stemEnglishPossessive="1"
preserveOriginal="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter
class="solr.EnglishMinimalStemFilterFactory"/>
<filter … synonyms="synonyms_multi.txt"/>
…
</analyzer>
© OpenSource Connections, 2018
Thank You
28
1 of 28

Recommended

Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber... by
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
471 views22 slides
Apache Helix presentation at SOCC 2012 by
Apache Helix presentation at SOCC 2012Apache Helix presentation at SOCC 2012
Apache Helix presentation at SOCC 2012Kishore Gopalakrishna
6.4K views34 slides
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx by
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptxKnowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptxNeo4j
226 views31 slides
Large Language Models - Chat AI.pdf by
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
677 views19 slides
Machine learning pipeline with spark ml by
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark mldatamantra
3.6K views34 slides
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di... by
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
12K views28 slides

More Related Content

What's hot

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
5.3K views42 slides
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s... by
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
233 views16 slides
Leveraging Generative AI & Best practices by
Leveraging Generative AI & Best practicesLeveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesDianaGray10
1.7K views21 slides
An introduction to Elasticsearch's advanced relevance ranking toolbox by
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
2.5K views155 slides
Improved alerting with Prometheus and Alertmanager by
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerJulien Pivotto
4.5K views74 slides
Solr consistency and recovery internals by
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internalsCloudera, Inc.
2.1K views15 slides

What's hot(20)

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by confluent
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent5.3K views
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s... by Mihai Criveti
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti233 views
Leveraging Generative AI & Best practices by DianaGray10
Leveraging Generative AI & Best practicesLeveraging Generative AI & Best practices
Leveraging Generative AI & Best practices
DianaGray101.7K views
An introduction to Elasticsearch's advanced relevance ranking toolbox by Elasticsearch
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
Elasticsearch2.5K views
Improved alerting with Prometheus and Alertmanager by Julien Pivotto
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
Julien Pivotto4.5K views
Solr consistency and recovery internals by Cloudera, Inc.
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
Cloudera, Inc.2.1K views
Data Streaming Ecosystem Management at Booking.com by confluent
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
confluent6K views
Seamless End-to-End Production Machine Learning with Seldon and MLflow by Databricks
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
Databricks1K views
Handle Large Messages In Apache Kafka by Jiangjie Qin
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin46.7K views
Fine tune and deploy Hugging Face NLP models by OVHcloud
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud952 views
Using Large Language Models in 10 Lines of Code by Gautier Marti
Using Large Language Models in 10 Lines of CodeUsing Large Language Models in 10 Lines of Code
Using Large Language Models in 10 Lines of Code
Gautier Marti1.3K views
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap by Anant Corporation
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation651 views
Get started with Dialogflow & Contact Center AI on Google Cloud by Daniel Zivkovic
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
Daniel Zivkovic785 views
Choosing between GitHub Copilot and ChatGPT: by InsaneAITools
Choosing between GitHub Copilot and ChatGPT:Choosing between GitHub Copilot and ChatGPT:
Choosing between GitHub Copilot and ChatGPT:
InsaneAITools428 views
Build an LLM-powered application using LangChain.pdf by StephenAmell4
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
StephenAmell4927 views
Deploying Kafka Streams Applications with Docker and Kubernetes by confluent
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
confluent549 views
Protocol Buffers and Hadoop at Twitter by Kevin Weil
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
Kevin Weil41.9K views
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl... by Lucidworks
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks29.9K views

Similar to The Solr (Multi-Terms) Synonyms Maze (Graphs)

The Apache Solr Semantic Knowledge Graph by
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
7.2K views67 slides
Managing your black Friday logs - CloudConf.IT by
Managing your black Friday logs - CloudConf.ITManaging your black Friday logs - CloudConf.IT
Managing your black Friday logs - CloudConf.ITDavid Pilato
85.7K views53 slides
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w... by
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...TigerGraph
168 views33 slides
Spring Up Your Graph by
Spring Up Your GraphSpring Up Your Graph
Spring Up Your GraphVMware Tanzu
232 views29 slides
Algorithm by
AlgorithmAlgorithm
Algorithmseobear
344 views4 slides

Similar to The Solr (Multi-Terms) Synonyms Maze (Graphs)(20)

The Apache Solr Semantic Knowledge Graph by Trey Grainger
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
Trey Grainger7.2K views
Managing your black Friday logs - CloudConf.IT by David Pilato
Managing your black Friday logs - CloudConf.ITManaging your black Friday logs - CloudConf.IT
Managing your black Friday logs - CloudConf.IT
David Pilato85.7K views
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w... by TigerGraph
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
TigerGraph168 views
Spring Up Your Graph by VMware Tanzu
Spring Up Your GraphSpring Up Your Graph
Spring Up Your Graph
VMware Tanzu232 views
Algorithm by seobear
AlgorithmAlgorithm
Algorithm
seobear344 views
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph by Jim Czuprynski
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle GraphGraphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
Jim Czuprynski12 views
For Python Quants Conference NYC 6th May 2016 by xlwings
For Python Quants Conference NYC 6th May 2016For Python Quants Conference NYC 6th May 2016
For Python Quants Conference NYC 6th May 2016
xlwings632 views
Collective Knowledge: python and scikit-learn based open research SDK for col... by Grigori Fursin
Collective Knowledge: python and scikit-learn based open research SDK for col...Collective Knowledge: python and scikit-learn based open research SDK for col...
Collective Knowledge: python and scikit-learn based open research SDK for col...
Grigori Fursin1K views
Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus... by Vaticle
Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...
Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...
Vaticle6K views
THE PERFORMANCE COMPARISON OF A BRUTEFORCE PASSWORD CRACKING ALGORITHM USING ... by ijsptm
THE PERFORMANCE COMPARISON OF A BRUTEFORCE PASSWORD CRACKING ALGORITHM USING ...THE PERFORMANCE COMPARISON OF A BRUTEFORCE PASSWORD CRACKING ALGORITHM USING ...
THE PERFORMANCE COMPARISON OF A BRUTEFORCE PASSWORD CRACKING ALGORITHM USING ...
ijsptm5 views
Meta-modeling: concepts, tools and applications by Saïd Assar
Meta-modeling: concepts, tools and applicationsMeta-modeling: concepts, tools and applications
Meta-modeling: concepts, tools and applications
Saïd Assar2.6K views
IRJET- A Key-Policy Attribute based Temporary Keyword Search Scheme for S... by IRJET Journal
IRJET-  	  A Key-Policy Attribute based Temporary Keyword Search Scheme for S...IRJET-  	  A Key-Policy Attribute based Temporary Keyword Search Scheme for S...
IRJET- A Key-Policy Attribute based Temporary Keyword Search Scheme for S...
IRJET Journal72 views
Data science apps: beyond notebooks by Natalino Busa
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
Natalino Busa762 views
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017 by Codemotion
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Codemotion789 views
Developing Your Own Flux Packages by David McKay | Head of Developer Relation... by InfluxData
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData155 views
Semantic Web in Action: Ontology-driven information search, integration and a... by Amit Sheth
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
Amit Sheth2.8K views
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato... by Neo4j
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
Neo4j61 views

Recently uploaded

DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...Deltares
6 views15 slides
Airline Booking Software by
Airline Booking SoftwareAirline Booking Software
Airline Booking SoftwareSharmiMehta
5 views26 slides
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...Deltares
9 views24 slides
Generic or specific? Making sensible software design decisions by
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
6 views60 slides
Navigating container technology for enhanced security by Niklas Saari by
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas SaariMetosin Oy
12 views34 slides
Dapr Unleashed: Accelerating Microservice Development by
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice DevelopmentMiroslav Janeski
9 views29 slides

Recently uploaded(20)

DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta5 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares9 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy12 views
Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin95 views
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by Deltares
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
Deltares6 views
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - GeertsemaDSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
Deltares17 views
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the... by Deltares
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
Deltares6 views
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge... by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
Deltares17 views
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares12 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptx
SakshiPatel827 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok5 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana8 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views

The Solr (Multi-Terms) Synonyms Maze (Graphs)

  • 1. © OpenSource Connections, 2018 The Solr Synonyms Maze Haystack - The Search Relevance Conference April 10-11, 2018 1 Bertrand Rigaldies Search Consultant OpenSource Connections brigaldies@o19s.com Linkedin: bertrandrigaldies Multi-Terms Of Graphs
  • 2. © OpenSource Connections, 2018 Agenda 1. Analysis Chain, Tokens, and Tokens Graph Explained 2. Issue #1: Index-Time “Sausagization” 3. Issue #2: Graphs Interactions 4. Workarounds 5. Recommendations Recap 2
  • 3. © OpenSource Connections, 2018 Tokens, and Tokens Graph The “analysis chain” of content to index, or a search string, produces a “tokens graph” 3 Synonyms: scifi, sci fi, science fiction End-user’s search/query: q=scifi movies
  • 4. © OpenSource Connections, 2018 4 Analysis Chain, and Tokens Separate terms (Tokenization) → tokens Remove stop words Lower case Remove possessive form (‘s) Protect selected terms against stemming Stemming Expand With Synonyms: scifi, sci fi, science fiction
  • 5. © OpenSource Connections, 2018 “Tokens Graph” 5 1 2 3 4 (End) X --sci------>X --fi--------------------->X X --science-----------------X--fiction--->X X --scifi-------------------------------->X 4 (End)
  • 6. © OpenSource Connections, 2018 6 Powered by Jupyter notebook Query-Time Synonyms Synonyms: scifi, sci fi, science fiction q=sci fi movies 👍 Document: … science fiction … defType=edismax *** sow=false *** (Query-time: Split on whitespace) qf=overview_query_syn +(DisjunctionMaxQuery((( (+overview_query_syn:sci +overview_query_syn:fi) (+overview_query_syn:science +overview_query_syn:fiction) overview_query_syn:scifi))) DisjunctionMaxQuery((overview_query_syn:movies))) qf=overview_query_syn_auto_phrase autoGeneratePhraseQueries=”true” +(DisjunctionMaxQuery((( overview_query_syn_auto_phrase:"sci fi" overview_query_syn_auto_phrase:"science fiction" overview_query_syn_auto_phrase:scifi))) DisjunctionMaxQuery((overview_query_syn_auto_phrase: movies)))
  • 7. © OpenSource Connections, 2018 7Powered by Jupyter notebook Query vs Index Synonyms: scifi, sci fi, science fiction q=sci fi movies 👍 Unexpected phrase matches! “scifi fiction movies” “scifi fi movies” Unexpected not matching! “scifi movies” Query-Time Index-Time All graph edges are forced to the next node! Document: … scifi movies … 👍
  • 8. © OpenSource Connections, 2018 Issue #1: Index-Time Multi-Terms Synonyms “Sausage” ● First documented by Michael McCandless blog in 2012: ○ “the indexer completely ignores PositionLength attribute; ... This means the indexer acts as if all arcs always arrive at the very next position” ● Also, nicely explained in this LucidWorks 2017 blog by Steve Rowe. 8
  • 9. © OpenSource Connections, 2018 Issue #1 Workaround (1/2): Avoid Indexing Multi-Terms Synonyms 9
  • 10. © OpenSource Connections, 2018 10 Split the synonyms into separate files for single- and multi-terms: <fieldType name="text_syn_split" … autoGeneratePhraseQueries=”true”> <analyzer type="index"> … <filter … synonyms="synonyms_single.txt"/> … </analyzer> <analyzer type="query"> … <filter … synonyms="synonyms_multi.txt"/> … </analyzer> </fieldType> Issue #1 Workaround (1/2): Configuration 👍 +DisjunctionMaxQuery(((( overview_query_syn_auto_phrase:"sci fi" overview_query_syn_auto_phrase:"science fiction" overview_query_syn_auto_phrase:scifi))))
  • 11. © OpenSource Connections, 2018 Issue #1 Workaround (2/2): Semantic Unit Injection syn1, syn2, … synN => __semantic_unit__ ● Query-Time Injection: Synonyms declaration: scifi, sci fi, science fiction => __science_fiction__ Query sci fi movies is analyzed as __science_fiction__ movies The parsed query is +(DisjunctionMaxQuery((overview_syn_sem_unit:__science_fiction__)) DisjunctionMaxQuery((overview_syn_sem_unit:movies))) ● Index-Time Injection: Document: sci fi movie on the invasion of earth is analyzed as __science_fiction__ movie on the invasion of earth 11
  • 12. © OpenSource Connections, 2018 12 ATTENTION: Semantic units must be injected at both Query and Index times so that any one of the synonyms in the search string and the index do match! Issue #1 Workaround (2/2 cont’d): Graph q=”scifi movies” q=”sci fi movies” q=”science fiction movies” 👍 Document: scifi movies Document: sci fi movies Document: science fiction movies
  • 13. © OpenSource Connections, 2018 Issue #1 Workaround (2b/2): Taxonomies term1 term2 term => __hyponym_semantic_unit__, hypernym ● Query-Time Conversion: Synonyms declaration: scifi, scifi, science fiction => __science_fiction__, __fiction__ Query sci fi movies is analyzed as (__science_fiction__ __fiction__) movies The parsed query is +(DisjunctionMaxQuery((Synonym(overview_index_taxonomy:__fiction__ overview_index_taxonomy:__science_fiction__))) DisjunctionMaxQuery((overview_index_taxonomy:movies))) ● Index-Time Conversion: Document: sci fi movie on the invasion of earth is analyzed as (__science_fiction__ __fiction__) movie ... Document: political fiction movie on the rise of fascism is analyzed as (__political_fiction__ __fiction__) movie ... 13
  • 14. © OpenSource Connections, 2018 14 Issue #1 Workaround (2b/2 cont’d): Graph q=sci fi movies +( DisjunctionMaxQuery(( Synonym( overview_index_taxonomy:__fiction__ overview_index_taxonomy:__science_fiction__) )) DisjunctionMaxQuery(( Overview_index_taxonomy:movies )) ) Document: political fiction movie Relevancy impact? DF(__fiction__) >= DF(__science_fiction__) Then, other movies only matching on __science__ have a lower score than the science fiction movies matching on both __science__ and __science_fiction__
  • 15. © OpenSource Connections, 2018 Issue #2: Intra-Analysis Graphs Interactions Issue when chaining several graph-producing filters such as SynonymGraphFilter and WordDelimiterGraphFilter in a field analysis chain. Not very well documented, or explained: ● Solr WordDelimiterGraphFilter documentation: “Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.” Not mentioned in the documentation of SynonymGraphFilter!? ● Bottom of JIRA ticket LUCENE-6664: “This new syn filter still cannot consume an arbitrary graph.” ● SynonymGraphFilter.java source code: “this cannot consume an incoming graph; results will be undefined.” 15
  • 16. © OpenSource Connections, 2018 Issue #2: Graph-Producing Filters Interaction ● Use Case #1: Several synonyms filters <analyzer type="query"> … <filter … synonyms="synonyms_multi.txt"/> <filter … synonyms="synonyms_multi_2.txt"/> … </analyzer> Synonyms_multi.txt: scifi, sci fi, science fiction Synonyms_multi_2.txt: fiction, story telling 16 +DisjunctionMaxQuery(((((+overview_syn_split:sci +overview_syn_split:fi +overview_syn_split:telling) (+overview_syn_split:science +overview_syn_split:fiction) (+overview_syn_split:science +overview_syn_split:story +overview_syn_split:telling) (+overview_syn_split:scifi +overview_syn_split:telling))))) 👍 Terms overlap!
  • 17. © OpenSource Connections, 2018 Issue #2: Graph-Producing Filters Interaction 17 ● Use Case #1 Workaround: ✅ Do not chain several Synonyms-producing graph filters: Gather all multi- terms synonyms in a single file. ✅ Be careful with any combos of synonyms files: <filter … synonyms="synonyms_single.txt"/> <filter … synonyms="synonyms_multi.txt"/> fiction, fable scifi, sci fi, science fiction Avoid terms overlap!
  • 18. © OpenSource Connections, 2018 Issue #2: Graph-Producing Filters Interaction ● Use Case #2: Word Delimiter + Synonyms Combo <analyzer type="query"> … <filter class="solr.WordDelimiterGraphFilterFactory" protected="content_wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="1"/> … <filter … synonyms="synonyms_multi.txt"/> … </analyzer> 18 scifi, sci fi, science fiction Sci-fi → 1 2 3 --sci-fi------------> --scifi-------------> --sci------>--fi---->
  • 19. © OpenSource Connections, 2018 Issue #2: Graph-Producing Filters Interaction Use Case #2 (Cont’d) 19 +DisjunctionMaxQuery(((((+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:science +overview_wdf_query_syn:fiction) (+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:scifi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:fiction +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:fiction +overview_wdf_query_syn:science +overview_wdf_query_syn:fiction) (+overview_wdf_query_syn:science +overview_wdf_query_syn:fiction +overview_wdf_query_syn:scifi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:sci-fi +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:scifi +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) Word Delimiter-produced Graph Synonym-produced Graph next next q=sci-fi 👍
  • 20. © OpenSource Connections, 2018 Issue #2: Graph-Producing Filters Interaction 20 ● Use Case #2 Workaround: Configure the Word Delimiter Graph filter to NOT produce a graph: <analyzer type="query"> … <filter class="solr.WordDelimiterGraphFilterFactory" protected="content_wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="0"/> … <filter … synonyms="synonyms_multi.txt"/> … </analyzer> Sci-fi → 1 2 3 --sci------>--fi---->
  • 21. © OpenSource Connections, 2018 Issue #2: Graph-Producing Filters Interaction Use Case #2 Workaround Graphs 21 q=sci-fi next next next +DisjunctionMaxQuery(((((+overview_wdf_no_concat_query_syn:sci +overview_wdf_no_concat_query_syn:fi) (+overview_wdf_no_concat_query_syn:science +overview_wdf_no_concat_query_syn:fiction) overview_wdf_no_concat_query_syn:scifi)))) 👍👍
  • 22. © OpenSource Connections, 2018 What if you need all parts produced by the Word Delimiter, not just the split parts? For example, let’s say we have 3 documents titled: 1. Lowcost Appliances Repair 2. Low Cost Appliances Parts 3. Affordable Appliances Fixing End-users search with: q=low-cost, lowcost, or low cost We have the synonyms: low cost, affordable Our title analysis chain used the White Space tokenizer, Word Delimiter with split-parts only, and a Synonyms filter. Then, q=lowcost matches document #1 only. lowcost → [WT] lowcost → [WDGF] lowcost → [SGF] lowcost q=low cost matches document #2 and #3 only! low cost → [WT] low | cost → [WDGF] low | cost → [SGF] (low affordable) | cost q=low-cost matches document #2 and #3 only! low-cost → [WT] low-cost → [WDGF] low | cost → [SGF] (low affordable) | cost 22 Issue #2: Graph-Producing Filters Interaction Use Case #2 Workaround Variation
  • 23. © OpenSource Connections, 2018 ● Generally, we like the decompounding done by Word Delimiter. ● So, we’ll let it generate all split and concatenated parts, e.g., for “a-b”: ○ (a ab a-b) | b ● But, for those terms also present in synonyms, we’ll protect them from decompounding in the Word Delimiter (see the protected parameter), and do the decompounding in the synonyms file: ○ low-cost, lowcost, low cost, affordable 23 Issue #2: Graph-Producing Filters Interaction Use Case #2 Workaround Variation (Cont’d) Search string WT WDGF SGF lowcost lowcost lowcost (lowcost low-cost low affordable) | cost low cost low | cost low | cost (lowcost low-cost low affordable) | cost low-cost low-cost low-cost (protected) (lowcost low-cost low affordable) | cost
  • 24. © OpenSource Connections, 2018 The “Flatten Graph” Filter 24 <analyzer type="index"> … <filter class="solr.WordDelimiterGraphFilterFactory" … /> <filter class="solr.FlattenGraphFilterFactory"/> … <filter class="solr.SynonymGraphFilterFactory" … synonyms="synonyms_multi_sem_unit.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" … synonyms="synonyms_single.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> … </analyzer>
  • 25. © OpenSource Connections, 2018 The “Flatten Graph” Filter (Cont’d) 25 Synonyms Expansion Graph Flattening next Matching phrases: “scifi fi” “scifi fiction” “sci fiction” “science fi”
  • 26. © OpenSource Connections, 2018 Multi-terms Synonyms: Recommendations ● Split single- and multi-terms synonyms in separate files for configuration flexibility. ○ Index-time single-terms synonyms ○ Index-time semantic units-based synonyms ○ Query-time multi-terms synonyms (sow=false!) ● Watch out for combinations of filters that produce non-flat graphs that overlap for one or more terms: ○ Two or more SynonymGraphFilter ○ WordDelimiterFilter (Split parts only) + SynonymGraphFilter 26
  • 27. © OpenSource Connections, 2018 A Working Configuration<analyzer type="index"> <charFilter … /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory" protected="wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="0"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter … synonyms="synonyms_single.txt"/> <filter … synonyms="synonyms_sem_units.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> … 27 <analyzer type="query"> <charFilter … /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory" protected="wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="0"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter … synonyms="synonyms_multi.txt"/> … </analyzer>
  • 28. © OpenSource Connections, 2018 Thank You 28

Editor's Notes

  1. Focus on the multi-terms synonyms issues.
  2. Focus on two specific issues and their workarounds.
  3. We’re dealing with graphs! That’s good thing.
  4. An analysis chain produces a series of tokens...
  5. … in a graph.
  6. A better way to visualize the multi-terms synonyms graph. autoGeneratePhraseQueries=”true” is great!
  7. Oh no, trouble at index-time with multi-terms
  8. Now a new issue! Check out these blogs...
  9. Let’s better understand what the documentation is telling us.
  10. A workaround.
  11. Another workaround.
  12. Graph is good now! It’s flat, hence no sausage will be made.
  13. A variation of the semantic unit injection solution. To improve recall.
  14. Second issue.
  15. When terms overlap in two or more graphs produced by the analysis chain.
  16. Workaround.
  17. Another use case with the Word Delimiter filter.
  18. What a mess!
  19. Workaround.
  20. All is good now.
  21. What does the strange FlattenGraph filter do? Why do we have to use it?
  22. Yep, it’s flattening the graph all right!
  23. Let’s recap with our recommendations.
  24. Start with this, and see if it works for you. If not, call OSC!