SlideShare a Scribd company logo
Submit Search
Upload
The Solr (Multi-Terms) Synonyms Maze (Graphs)
Report
Share
B
Bertrand Rigaldies
Follow
•
1 like
•
1,407 views
1
of
28
The Solr (Multi-Terms) Synonyms Maze (Graphs)
•
1 like
•
1,407 views
Report
Share
Download Now
Download to read offline
Software
A brief tour of the issues and workarounds related to the use of multi-terms synonyms in Solr.
Read more
B
Bertrand Rigaldies
Follow
Recommended
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber... by
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
471 views
•
22 slides
Apache Helix presentation at SOCC 2012 by
Apache Helix presentation at SOCC 2012
Kishore Gopalakrishna
6.4K views
•
34 slides
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx by
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx
Neo4j
226 views
•
31 slides
Large Language Models - Chat AI.pdf by
Large Language Models - Chat AI.pdf
David Rostcheck
677 views
•
19 slides
Machine learning pipeline with spark ml by
Machine learning pipeline with spark ml
datamantra
3.6K views
•
34 slides
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di... by
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
confluent
12K views
•
28 slides
More Related Content
What's hot
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
5.3K views
•
42 slides
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s... by
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
233 views
•
16 slides
Leveraging Generative AI & Best practices by
Leveraging Generative AI & Best practices
DianaGray10
1.7K views
•
21 slides
An introduction to Elasticsearch's advanced relevance ranking toolbox by
An introduction to Elasticsearch's advanced relevance ranking toolbox
Elasticsearch
2.5K views
•
155 slides
Improved alerting with Prometheus and Alertmanager by
Improved alerting with Prometheus and Alertmanager
Julien Pivotto
4.5K views
•
74 slides
Solr consistency and recovery internals by
Solr consistency and recovery internals
Cloudera, Inc.
2.1K views
•
15 slides
What's hot
(20)
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by confluent
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
•
5.3K views
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s... by Mihai Criveti
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
•
233 views
Leveraging Generative AI & Best practices by DianaGray10
Leveraging Generative AI & Best practices
DianaGray10
•
1.7K views
An introduction to Elasticsearch's advanced relevance ranking toolbox by Elasticsearch
An introduction to Elasticsearch's advanced relevance ranking toolbox
Elasticsearch
•
2.5K views
Improved alerting with Prometheus and Alertmanager by Julien Pivotto
Improved alerting with Prometheus and Alertmanager
Julien Pivotto
•
4.5K views
Solr consistency and recovery internals by Cloudera, Inc.
Solr consistency and recovery internals
Cloudera, Inc.
•
2.1K views
Data Streaming Ecosystem Management at Booking.com by confluent
Data Streaming Ecosystem Management at Booking.com
confluent
•
6K views
Seamless End-to-End Production Machine Learning with Seldon and MLflow by Databricks
Seamless End-to-End Production Machine Learning with Seldon and MLflow
Databricks
•
1K views
Handle Large Messages In Apache Kafka by Jiangjie Qin
Handle Large Messages In Apache Kafka
Jiangjie Qin
•
46.7K views
Fine tune and deploy Hugging Face NLP models by OVHcloud
Fine tune and deploy Hugging Face NLP models
OVHcloud
•
952 views
Using Large Language Models in 10 Lines of Code by Gautier Marti
Using Large Language Models in 10 Lines of Code
Gautier Marti
•
1.3K views
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap by Anant Corporation
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
•
651 views
Get started with Dialogflow & Contact Center AI on Google Cloud by Daniel Zivkovic
Get started with Dialogflow & Contact Center AI on Google Cloud
Daniel Zivkovic
•
785 views
Best Practice on using Azure OpenAI Service by Kumton Suttiraksiri
Best Practice on using Azure OpenAI Service
Kumton Suttiraksiri
•
517 views
Choosing between GitHub Copilot and ChatGPT: by InsaneAITools
Choosing between GitHub Copilot and ChatGPT:
InsaneAITools
•
428 views
Build an LLM-powered application using LangChain.pdf by StephenAmell4
Build an LLM-powered application using LangChain.pdf
StephenAmell4
•
927 views
Deploying Kafka Streams Applications with Docker and Kubernetes by confluent
Deploying Kafka Streams Applications with Docker and Kubernetes
confluent
•
549 views
Service Mesh - Observability by Araf Karsh Hamid
Service Mesh - Observability
Araf Karsh Hamid
•
376 views
Protocol Buffers and Hadoop at Twitter by Kevin Weil
Protocol Buffers and Hadoop at Twitter
Kevin Weil
•
41.9K views
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl... by Lucidworks
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
•
29.9K views
Similar to The Solr (Multi-Terms) Synonyms Maze (Graphs)
The Apache Solr Semantic Knowledge Graph by
The Apache Solr Semantic Knowledge Graph
Trey Grainger
7.2K views
•
67 slides
Managing your black Friday logs - CloudConf.IT by
Managing your black Friday logs - CloudConf.IT
David Pilato
85.7K views
•
53 slides
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w... by
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
TigerGraph
168 views
•
33 slides
Spring Up Your Graph by
Spring Up Your Graph
VMware Tanzu
232 views
•
29 slides
Algorithm by
Algorithm
seobear
344 views
•
4 slides
Focused Crawling System based on Improved LSI by
Focused Crawling System based on Improved LSI
International Journal of Science and Research (IJSR)
409 views
•
4 slides
Similar to The Solr (Multi-Terms) Synonyms Maze (Graphs)
(20)
The Apache Solr Semantic Knowledge Graph by Trey Grainger
The Apache Solr Semantic Knowledge Graph
Trey Grainger
•
7.2K views
Managing your black Friday logs - CloudConf.IT by David Pilato
Managing your black Friday logs - CloudConf.IT
David Pilato
•
85.7K views
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w... by TigerGraph
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
TigerGraph
•
168 views
Spring Up Your Graph by VMware Tanzu
Spring Up Your Graph
VMware Tanzu
•
232 views
Algorithm by seobear
Algorithm
seobear
•
344 views
Focused Crawling System based on Improved LSI by International Journal of Science and Research (IJSR)
Focused Crawling System based on Improved LSI
International Journal of Science and Research (IJSR)
•
409 views
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph by Jim Czuprynski
Graphing Grifters: Identify & Display Patterns of Corruption With Oracle Graph
Jim Czuprynski
•
12 views
For Python Quants Conference NYC 6th May 2016 by xlwings
For Python Quants Conference NYC 6th May 2016
xlwings
•
632 views
Collective Knowledge: python and scikit-learn based open research SDK for col... by Grigori Fursin
Collective Knowledge: python and scikit-learn based open research SDK for col...
Grigori Fursin
•
1K views
Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus... by Vaticle
Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...
Vaticle
•
6K views
THE PERFORMANCE COMPARISON OF A BRUTEFORCE PASSWORD CRACKING ALGORITHM USING ... by ijsptm
THE PERFORMANCE COMPARISON OF A BRUTEFORCE PASSWORD CRACKING ALGORITHM USING ...
ijsptm
•
5 views
Meta-modeling: concepts, tools and applications by Saïd Assar
Meta-modeling: concepts, tools and applications
Saïd Assar
•
2.6K views
IRJET- A Key-Policy Attribute based Temporary Keyword Search Scheme for S... by IRJET Journal
IRJET- A Key-Policy Attribute based Temporary Keyword Search Scheme for S...
IRJET Journal
•
72 views
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven... by Information Security Awareness Group
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...
Information Security Awareness Group
•
667 views
Data science apps: beyond notebooks by Natalino Busa
Data science apps: beyond notebooks
Natalino Busa
•
762 views
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017 by Codemotion
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Codemotion
•
789 views
Developing Your Own Flux Packages by David McKay | Head of Developer Relation... by InfluxData
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData
•
155 views
Semantic Web in Action: Ontology-driven information search, integration and a... by Amit Sheth
Semantic Web in Action: Ontology-driven information search, integration and a...
Amit Sheth
•
2.8K views
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato... by Neo4j
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
Neo4j
•
61 views
529 199-206 by idescitation
529 199-206
idescitation
•
371 views
Recently uploaded
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares
6 views
•
15 slides
Airline Booking Software by
Airline Booking Software
SharmiMehta
5 views
•
26 slides
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares
9 views
•
24 slides
Generic or specific? Making sensible software design decisions by
Generic or specific? Making sensible software design decisions
Bert Jan Schrijver
6 views
•
60 slides
Navigating container technology for enhanced security by Niklas Saari by
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy
12 views
•
34 slides
Dapr Unleashed: Accelerating Microservice Development by
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski
9 views
•
29 slides
Recently uploaded
(20)
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares
•
6 views
Airline Booking Software by SharmiMehta
Airline Booking Software
SharmiMehta
•
5 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares
•
9 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisions
Bert Jan Schrijver
•
6 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy
•
12 views
Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski
•
9 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin9
•
5 views
Agile 101 by John Valentino
Agile 101
John Valentino
•
6 views
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by Deltares
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
Deltares
•
6 views
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
Deltares
•
17 views
Programming Field by thehardtechnology
Programming Field
thehardtechnology
•
5 views
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the... by Deltares
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
Deltares
•
6 views
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge... by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
Deltares
•
17 views
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares
•
12 views
Unleash The Monkeys by Jacob Duijzer
Unleash The Monkeys
Jacob Duijzer
•
7 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptx
SakshiPatel82
•
7 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok
•
5 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana
•
8 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improved
Federico Razzoli
•
8 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller
•
38 views
The Solr (Multi-Terms) Synonyms Maze (Graphs)
1.
© OpenSource Connections,
2018 The Solr Synonyms Maze Haystack - The Search Relevance Conference April 10-11, 2018 1 Bertrand Rigaldies Search Consultant OpenSource Connections brigaldies@o19s.com Linkedin: bertrandrigaldies Multi-Terms Of Graphs
2.
© OpenSource Connections,
2018 Agenda 1. Analysis Chain, Tokens, and Tokens Graph Explained 2. Issue #1: Index-Time “Sausagization” 3. Issue #2: Graphs Interactions 4. Workarounds 5. Recommendations Recap 2
3.
© OpenSource Connections,
2018 Tokens, and Tokens Graph The “analysis chain” of content to index, or a search string, produces a “tokens graph” 3 Synonyms: scifi, sci fi, science fiction End-user’s search/query: q=scifi movies
4.
© OpenSource Connections,
2018 4 Analysis Chain, and Tokens Separate terms (Tokenization) → tokens Remove stop words Lower case Remove possessive form (‘s) Protect selected terms against stemming Stemming Expand With Synonyms: scifi, sci fi, science fiction
5.
© OpenSource Connections,
2018 “Tokens Graph” 5 1 2 3 4 (End) X --sci------>X --fi--------------------->X X --science-----------------X--fiction--->X X --scifi-------------------------------->X 4 (End)
6.
© OpenSource Connections,
2018 6 Powered by Jupyter notebook Query-Time Synonyms Synonyms: scifi, sci fi, science fiction q=sci fi movies 👍 Document: … science fiction … defType=edismax *** sow=false *** (Query-time: Split on whitespace) qf=overview_query_syn +(DisjunctionMaxQuery((( (+overview_query_syn:sci +overview_query_syn:fi) (+overview_query_syn:science +overview_query_syn:fiction) overview_query_syn:scifi))) DisjunctionMaxQuery((overview_query_syn:movies))) qf=overview_query_syn_auto_phrase autoGeneratePhraseQueries=”true” +(DisjunctionMaxQuery((( overview_query_syn_auto_phrase:"sci fi" overview_query_syn_auto_phrase:"science fiction" overview_query_syn_auto_phrase:scifi))) DisjunctionMaxQuery((overview_query_syn_auto_phrase: movies)))
7.
© OpenSource Connections,
2018 7Powered by Jupyter notebook Query vs Index Synonyms: scifi, sci fi, science fiction q=sci fi movies 👍 Unexpected phrase matches! “scifi fiction movies” “scifi fi movies” Unexpected not matching! “scifi movies” Query-Time Index-Time All graph edges are forced to the next node! Document: … scifi movies … 👍
8.
© OpenSource Connections,
2018 Issue #1: Index-Time Multi-Terms Synonyms “Sausage” ● First documented by Michael McCandless blog in 2012: ○ “the indexer completely ignores PositionLength attribute; ... This means the indexer acts as if all arcs always arrive at the very next position” ● Also, nicely explained in this LucidWorks 2017 blog by Steve Rowe. 8
9.
© OpenSource Connections,
2018 Issue #1 Workaround (1/2): Avoid Indexing Multi-Terms Synonyms 9
10.
© OpenSource Connections,
2018 10 Split the synonyms into separate files for single- and multi-terms: <fieldType name="text_syn_split" … autoGeneratePhraseQueries=”true”> <analyzer type="index"> … <filter … synonyms="synonyms_single.txt"/> … </analyzer> <analyzer type="query"> … <filter … synonyms="synonyms_multi.txt"/> … </analyzer> </fieldType> Issue #1 Workaround (1/2): Configuration 👍 +DisjunctionMaxQuery(((( overview_query_syn_auto_phrase:"sci fi" overview_query_syn_auto_phrase:"science fiction" overview_query_syn_auto_phrase:scifi))))
11.
© OpenSource Connections,
2018 Issue #1 Workaround (2/2): Semantic Unit Injection syn1, syn2, … synN => __semantic_unit__ ● Query-Time Injection: Synonyms declaration: scifi, sci fi, science fiction => __science_fiction__ Query sci fi movies is analyzed as __science_fiction__ movies The parsed query is +(DisjunctionMaxQuery((overview_syn_sem_unit:__science_fiction__)) DisjunctionMaxQuery((overview_syn_sem_unit:movies))) ● Index-Time Injection: Document: sci fi movie on the invasion of earth is analyzed as __science_fiction__ movie on the invasion of earth 11
12.
© OpenSource Connections,
2018 12 ATTENTION: Semantic units must be injected at both Query and Index times so that any one of the synonyms in the search string and the index do match! Issue #1 Workaround (2/2 cont’d): Graph q=”scifi movies” q=”sci fi movies” q=”science fiction movies” 👍 Document: scifi movies Document: sci fi movies Document: science fiction movies
13.
© OpenSource Connections,
2018 Issue #1 Workaround (2b/2): Taxonomies term1 term2 term => __hyponym_semantic_unit__, hypernym ● Query-Time Conversion: Synonyms declaration: scifi, scifi, science fiction => __science_fiction__, __fiction__ Query sci fi movies is analyzed as (__science_fiction__ __fiction__) movies The parsed query is +(DisjunctionMaxQuery((Synonym(overview_index_taxonomy:__fiction__ overview_index_taxonomy:__science_fiction__))) DisjunctionMaxQuery((overview_index_taxonomy:movies))) ● Index-Time Conversion: Document: sci fi movie on the invasion of earth is analyzed as (__science_fiction__ __fiction__) movie ... Document: political fiction movie on the rise of fascism is analyzed as (__political_fiction__ __fiction__) movie ... 13
14.
© OpenSource Connections,
2018 14 Issue #1 Workaround (2b/2 cont’d): Graph q=sci fi movies +( DisjunctionMaxQuery(( Synonym( overview_index_taxonomy:__fiction__ overview_index_taxonomy:__science_fiction__) )) DisjunctionMaxQuery(( Overview_index_taxonomy:movies )) ) Document: political fiction movie Relevancy impact? DF(__fiction__) >= DF(__science_fiction__) Then, other movies only matching on __science__ have a lower score than the science fiction movies matching on both __science__ and __science_fiction__
15.
© OpenSource Connections,
2018 Issue #2: Intra-Analysis Graphs Interactions Issue when chaining several graph-producing filters such as SynonymGraphFilter and WordDelimiterGraphFilter in a field analysis chain. Not very well documented, or explained: ● Solr WordDelimiterGraphFilter documentation: “Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.” Not mentioned in the documentation of SynonymGraphFilter!? ● Bottom of JIRA ticket LUCENE-6664: “This new syn filter still cannot consume an arbitrary graph.” ● SynonymGraphFilter.java source code: “this cannot consume an incoming graph; results will be undefined.” 15
16.
© OpenSource Connections,
2018 Issue #2: Graph-Producing Filters Interaction ● Use Case #1: Several synonyms filters <analyzer type="query"> … <filter … synonyms="synonyms_multi.txt"/> <filter … synonyms="synonyms_multi_2.txt"/> … </analyzer> Synonyms_multi.txt: scifi, sci fi, science fiction Synonyms_multi_2.txt: fiction, story telling 16 +DisjunctionMaxQuery(((((+overview_syn_split:sci +overview_syn_split:fi +overview_syn_split:telling) (+overview_syn_split:science +overview_syn_split:fiction) (+overview_syn_split:science +overview_syn_split:story +overview_syn_split:telling) (+overview_syn_split:scifi +overview_syn_split:telling))))) 👍 Terms overlap!
17.
© OpenSource Connections,
2018 Issue #2: Graph-Producing Filters Interaction 17 ● Use Case #1 Workaround: ✅ Do not chain several Synonyms-producing graph filters: Gather all multi- terms synonyms in a single file. ✅ Be careful with any combos of synonyms files: <filter … synonyms="synonyms_single.txt"/> <filter … synonyms="synonyms_multi.txt"/> fiction, fable scifi, sci fi, science fiction Avoid terms overlap!
18.
© OpenSource Connections,
2018 Issue #2: Graph-Producing Filters Interaction ● Use Case #2: Word Delimiter + Synonyms Combo <analyzer type="query"> … <filter class="solr.WordDelimiterGraphFilterFactory" protected="content_wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="1"/> … <filter … synonyms="synonyms_multi.txt"/> … </analyzer> 18 scifi, sci fi, science fiction Sci-fi → 1 2 3 --sci-fi------------> --scifi-------------> --sci------>--fi---->
19.
© OpenSource Connections,
2018 Issue #2: Graph-Producing Filters Interaction Use Case #2 (Cont’d) 19 +DisjunctionMaxQuery(((((+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:science +overview_wdf_query_syn:fiction) (+overview_wdf_query_syn:sci +overview_wdf_query_syn:fi +overview_wdf_query_syn:scifi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:fiction +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:fiction +overview_wdf_query_syn:science +overview_wdf_query_syn:fiction) (+overview_wdf_query_syn:science +overview_wdf_query_syn:fiction +overview_wdf_query_syn:scifi) (+overview_wdf_query_syn:science +overview_wdf_query_syn:sci-fi +overview_wdf_query_syn:fi) (+overview_wdf_query_syn:scifi +overview_wdf_query_syn:sci +overview_wdf_query_syn:fi) Word Delimiter-produced Graph Synonym-produced Graph next next q=sci-fi 👍
20.
© OpenSource Connections,
2018 Issue #2: Graph-Producing Filters Interaction 20 ● Use Case #2 Workaround: Configure the Word Delimiter Graph filter to NOT produce a graph: <analyzer type="query"> … <filter class="solr.WordDelimiterGraphFilterFactory" protected="content_wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="0"/> … <filter … synonyms="synonyms_multi.txt"/> … </analyzer> Sci-fi → 1 2 3 --sci------>--fi---->
21.
© OpenSource Connections,
2018 Issue #2: Graph-Producing Filters Interaction Use Case #2 Workaround Graphs 21 q=sci-fi next next next +DisjunctionMaxQuery(((((+overview_wdf_no_concat_query_syn:sci +overview_wdf_no_concat_query_syn:fi) (+overview_wdf_no_concat_query_syn:science +overview_wdf_no_concat_query_syn:fiction) overview_wdf_no_concat_query_syn:scifi)))) 👍👍
22.
© OpenSource Connections,
2018 What if you need all parts produced by the Word Delimiter, not just the split parts? For example, let’s say we have 3 documents titled: 1. Lowcost Appliances Repair 2. Low Cost Appliances Parts 3. Affordable Appliances Fixing End-users search with: q=low-cost, lowcost, or low cost We have the synonyms: low cost, affordable Our title analysis chain used the White Space tokenizer, Word Delimiter with split-parts only, and a Synonyms filter. Then, q=lowcost matches document #1 only. lowcost → [WT] lowcost → [WDGF] lowcost → [SGF] lowcost q=low cost matches document #2 and #3 only! low cost → [WT] low | cost → [WDGF] low | cost → [SGF] (low affordable) | cost q=low-cost matches document #2 and #3 only! low-cost → [WT] low-cost → [WDGF] low | cost → [SGF] (low affordable) | cost 22 Issue #2: Graph-Producing Filters Interaction Use Case #2 Workaround Variation
23.
© OpenSource Connections,
2018 ● Generally, we like the decompounding done by Word Delimiter. ● So, we’ll let it generate all split and concatenated parts, e.g., for “a-b”: ○ (a ab a-b) | b ● But, for those terms also present in synonyms, we’ll protect them from decompounding in the Word Delimiter (see the protected parameter), and do the decompounding in the synonyms file: ○ low-cost, lowcost, low cost, affordable 23 Issue #2: Graph-Producing Filters Interaction Use Case #2 Workaround Variation (Cont’d) Search string WT WDGF SGF lowcost lowcost lowcost (lowcost low-cost low affordable) | cost low cost low | cost low | cost (lowcost low-cost low affordable) | cost low-cost low-cost low-cost (protected) (lowcost low-cost low affordable) | cost
24.
© OpenSource Connections,
2018 The “Flatten Graph” Filter 24 <analyzer type="index"> … <filter class="solr.WordDelimiterGraphFilterFactory" … /> <filter class="solr.FlattenGraphFilterFactory"/> … <filter class="solr.SynonymGraphFilterFactory" … synonyms="synonyms_multi_sem_unit.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" … synonyms="synonyms_single.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> … </analyzer>
25.
© OpenSource Connections,
2018 The “Flatten Graph” Filter (Cont’d) 25 Synonyms Expansion Graph Flattening next Matching phrases: “scifi fi” “scifi fiction” “sci fiction” “science fi”
26.
© OpenSource Connections,
2018 Multi-terms Synonyms: Recommendations ● Split single- and multi-terms synonyms in separate files for configuration flexibility. ○ Index-time single-terms synonyms ○ Index-time semantic units-based synonyms ○ Query-time multi-terms synonyms (sow=false!) ● Watch out for combinations of filters that produce non-flat graphs that overlap for one or more terms: ○ Two or more SynonymGraphFilter ○ WordDelimiterFilter (Split parts only) + SynonymGraphFilter 26
27.
© OpenSource Connections,
2018 A Working Configuration<analyzer type="index"> <charFilter … /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory" protected="wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="0"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter … synonyms="synonyms_single.txt"/> <filter … synonyms="synonyms_sem_units.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> … 27 <analyzer type="query"> <charFilter … /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory" protected="wdf_protected_terms.txt" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="1" preserveOriginal="0"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter … synonyms="synonyms_multi.txt"/> … </analyzer>
28.
© OpenSource Connections,
2018 Thank You 28
Editor's Notes
Focus on the multi-terms synonyms issues.
Focus on two specific issues and their workarounds.
We’re dealing with graphs! That’s good thing.
An analysis chain produces a series of tokens...
… in a graph.
A better way to visualize the multi-terms synonyms graph. autoGeneratePhraseQueries=”true” is great!
Oh no, trouble at index-time with multi-terms
Now a new issue! Check out these blogs...
Let’s better understand what the documentation is telling us.
A workaround.
Another workaround.
Graph is good now! It’s flat, hence no sausage will be made.
A variation of the semantic unit injection solution. To improve recall.
Second issue.
When terms overlap in two or more graphs produced by the analysis chain.
Workaround.
Another use case with the Word Delimiter filter.
What a mess!
Workaround.
All is good now.
What does the strange FlattenGraph filter do? Why do we have to use it?
Yep, it’s flattening the graph all right!
Let’s recap with our recommendations.
Start with this, and see if it works for you. If not, call OSC!