Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies

Principal, OpenSource Connections and Solr Consultant at OpenSource Connections
May. 17, 2019
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies
1 of 22

More Related Content

What's hot

Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopSaumitra Srivastav
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
Express node jsExpress node js
Express node jsYashprit Singh

Similar to Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies

Harnessing The Power of Search - Liferay DEVCON 2015, Darmstadt, GermanyHarnessing The Power of Search - Liferay DEVCON 2015, Darmstadt, Germany
Harnessing The Power of Search - Liferay DEVCON 2015, Darmstadt, GermanyAndré Ricardo Barreto de Oliveira
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"George Stathis
N1QL: What's new in Couchbase 5.0N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0Keshav Murthy
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014Lucian Precup
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution

Similar to Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies(20)

More from OpenSource Connections

EncoresEncores
EncoresOpenSource Connections
Test driven relevancyTest driven relevancy
Test driven relevancyOpenSource Connections
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with SolrOpenSource Connections
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections

More from OpenSource Connections(20)

Recently uploaded

Classification AlgorithmsClassification Algorithms
Classification AlgorithmsSandeepAgrawal84
Industrial attachment at Impress Newtex Composite Textiles Limited.pptxIndustrial attachment at Impress Newtex Composite Textiles Limited.pptx
Industrial attachment at Impress Newtex Composite Textiles Limited.pptxEmranKabirSubarno
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Timothy Spann
Career Council Survery.pptxCareer Council Survery.pptx
Career Council Survery.pptxGhazalaZahid1
OW_13092023_EN_www.pdfOW_13092023_EN_www.pdf
OW_13092023_EN_www.pdfPiotrak11
DIGITAL TRANSFORMATION AND STRATEGY_final.pptxDIGITAL TRANSFORMATION AND STRATEGY_final.pptx
DIGITAL TRANSFORMATION AND STRATEGY_final.pptxGeorgeDiamandis11

Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Bertrand Rigaldies

Editor's Notes

  1. Good morning everyone. My name is Bertrand Rigaldies. I am an OpenSource Connections search consultant. I joined OSC in early 2017. I have worked primarily in Solr, working on a variety of search relevancy issues. Lately, I have been very fortunate to work on a custom query parser for a great client, represented by some of you in the audience. A large part of this talk has been inspired by this work. In this talk I would like to share with you my experience with Query Parsers: Why and when do we need them? How to write one? The different design and implementation options, their pros and cons, and some pitfalls. This talk is definitely more engineering than science, more back-to-the-fundamentals than let’s-re-invent-search. It’s a let’s lift the hood and see the different nuts and bolts of Query Parsers. Hopefully the talk will give some ideas you can take back to your jobs. Quick polling of the audience on the topic: Raise your arm if you have no idea or only a vague idea of what a query parser is Raise your arm if you have participated in the development of a custom query parser? Raise your arm if you are a developer, and/or you’re comfortable with Java?
  2. Some basics first. Let’s locate the Query Parser, as an architecture component, in the overall Solr (or ES) architecture. This slide was borrowed and adapted from the OSC Solr training material. Talking Points: Where does query parsers belong in the big picture of a search engine? Left side handles documents indexing Right side handles querying The concentric circles should be from center going out Query goes through the following 4 concentric circles of processing: Normalization of the search terms (tokenization, filters, etc.) Matching and ranking, which is responsibility of query parsers Decoration, such as snipetting, highlighting, term vectors spell checking Analytics, such as facets In this talk we’ll be focusing on the Matching and Ranking ring.
  3. SLIDE CAPTION ONLY So, what is the problem? Well, with Query Parsers, we are staring at one of the core challenges of search engines: How to understand the text that the end-user typed or spoke, and turn it into code that can be executed to search. Click NEXT The first part of the problem is essentially the classic problem of compiling a high-level formalism (e.g., everyday language English, or other more formal form of search expressions) to low-level executable code). For the most part, the first issue is well understood from a computer science standpoint. There are several great tools to generate compilers (javacc, Antler, etc.; Note: The Lucene classic search language is implemented in javacc). And, as we’ll see, Lucene provides a rich set of search primitives that we can use to create executable search code. CLICK NEXT The second part of the problem is more challenging, and has to do with “query understanding”: What is the end-user saying, and what is the appropriate executable search construct? The Holy Grail of search is to search what the end-user means, not what he/she typed. Well, we haven’t invented a compiler that can do that yet! Ha ha. Now, more practically for our applications design, we should ask ourselves how end-users will search. That understanding will inform how to parse the text they type or speak. NEXT, NEXT, etc. So-called “Natural” language, a la Google Or, more formal languages: Boolean and proximity like the Classic Lucene syntax More advanced than the Classic, with operators like as-is, capitalization, clause frequency, etc. Or, some kind of hybrid
  4. So, to wrap up this context-setting slide: At a philosophical level, this PROBLEM may the FIRST relevancy issue in a search application project: How do we translate the end-user’s high-level search expression into an executable that will most effectively approximate what the end-user is looking for?
  5. ANIMATION! What can we do in Solr (or ES) in order to address the problem? Good news is: Solr offers powerful out-of-the-box Query Parsers. The edismax is like a power tool: With it comes great responsibility. And Doug wrote three chapters on the subtleties and pros and cons of multi-field queries, and the pros and cons of terms- vs. fields-centric approaches. I spent a year tuning a system using the edismax with many fields. It’s hard work! Which requires a mature relevancy testing infrastructure by the way, but you knew that. Ask the audience who has been using the edismax in their applications? There is a very rich query parsers eco-system in Solr: Solr 7.7 Other Query Parsers
  6. But, how far can I go with the Solr query parsers? Pretty far actually! For example, in terms of queries specifying some proximity between terms, there is little-known query parser called the “surround” query parser. Check it out in the Solr doc. It’s implemented in Lucene by a separate javacc grammar (See the package org.apache.lucene.queryparser.surround.parser). Solr demo: http://localhost:8983/solr/demo/select?debugQuery=on&df=title_t&q=%7B!surround%7D%205n(2n(donald%2Ctrump)%2C%20impeached%20OR%20impeachment)&wt=json TODO: Change to an example that is not (too) political Green legislation Search for the capitalized term “Green” (as if Green New Deal), but not the color “green”, which is within X positions of the term “legislation” cap(green) w/5 legislation Note: the position count in the surround operator is “slop + 1” (Number of positions between terms + 1)
  7. So, the Solr toolbox provides many query parsers that can be combined in arbitrarily complex compositions. Solr demo: http://localhost:8983/solr/demo/select?debugQuery=on&df=title_t&q=_query_%3A%22%7B!lucene%7D%5C%22green%20deal%5C%22%22%20%0AAND%20%0A_query_%3A%22%7B!surround%7D%205n(congress%2C%20democrat)%22&wt=json
  8. Example of QPs composition with XML. Show on Postman > Haystack 2019 > XML Query Parser Demo: Postman
  9. Example of QPs composition with ES-esque JSON Query DSL Show on Postman > Haystack 2019 > JSON Query Parser Demo
  10. ANIMATION! NEXT: But there are limitations that could be showstoppers to meet your functional requirements. E.g., in the surround QP NEXT: What about operators that do not exist in Lucene or Solr? NEXT :Enter the world of custom query parsers...
  11. Show, explain, and run the code from IntelliJ. We’re going to improve the surround QP in two areas: Analyze the search terms Not have any distance limitation
  12. Quick Query Parser anatomy with a couple of slides and then we’ll do a high-level code walk through. This is Java code, so hopefully we’re okay with that. My apology to those in the room that will have a hard time seeing the code. But, one takeaway should be for all that there is an easy and convenient set of Solr plugin patterns as well as Lucene search primitives that make the creation of custom query parsers very approachable for you all. Show request handler in IntelliJ.
  13. Note that the query parser does not execute the query. Its sole responsibility is to produce the executable query and return to Solr. Good news is that, as a query parser writer, we don’t have to worry about query execution concerns such as: filtering, pagination, highlights, facets, boosting (more on that later). Let’s walk through the code (next slide).
  14. Go to the Solr UI to play with our proximity query parser FIRST, show the plugin in action: Query with it: {!proximity} GREEN w5 Deal http://localhost:8983/solr/demo/select?debugQuery=on&fl=*&q=%7B!proximity%7D%20GREEN%20w5%20Deal&qf=title_t&wt=json Analyzed search terms (Donald and IMPEACHED analyzed to trump and impeached) No limit to the distance (100) Show the generate Lucene query with debugQuery=on Show where the jar is deployed (Normally pushed to an artifacts repository, and deployed to your Solr nodes) Show the plugin is listed: http://localhost:8983/solr/#/demo/plugins?type=queryparser&entry=com.o19s.solr.qparser.ProximityQParserPlugin Query using the request handler /proximity: No match (mm=100, “democrat” does not match because we have no stemmer): http://localhost:8983/solr/demo/proximity?=&debugQuery=on&fl=*&mm=100&q=green%20new%20w5%20deal%20democrat&qf=title_t&wt=json mm=50%: http://localhost:8983/solr/demo/proximity?=&debugQuery=on&fl=*&mm=50&q=green%20new%20w5%20deal%20democrat&qf=title_t&wt=json Show Boosting, Highlights, Sort, Facets
  15. SCORING? You get the behavior of the underlying Lucene primitives and how they are composed? Go to Splainer live: http://splainer.io/#?solr=http:%2F%2Flocalhost:8983%2Fsolr%2Fdemo%2Fproximity%3FdebugQuery%3Don%26q%3DDonald%20w100%20IMPEACHED%26qf%3Dtitle_t&fieldSpec=* Just showing that the scoring is provided by the underlying Lucene primitives. TIDBITS: IDF(phrase) = sum of the IDFs of the phrase’s terms The span’s “phrase frequency” is 1 / (distance + 1) The span frequency (0.333…) is calculated in the BM25Similarity class (See line 77 in the 7_7 branch): protected float sloppyFreq(int distance) { return 1.0f / (distance + 1); } Can the score be customized? Yes, but that involves peeling the next layer of the Lucene onion and get into the Weigh and Scorer classes. A presentation for another time. Also, show Quepid is fine.
  16. Recap: Different approaches to QPs. Notice the location of the Query Parser in the center solution: It is in the application! The application parses the end-user’s string and produces a Solr search using the Perl-like notation, XML, or JSON.
  17. Ease of Relevance Tuning: Edismax: the good, bad, and ugy QPs composition: Underlying QPs’ knobs and dials Custom Query Parser: More software to write: Multi-fields Synonyms Compounds DisMax behavior Terms- vs fields-centric Tie mm etc.
  18. Often times, entities such as dates, numbers, people, places, institutions, etc. must be recognized as a first-pass before producing the parse tree so that the recognized entities are leaves themselves. In a large SolrCloud cluster, if the entity recognition is an expensive operation, perhaps involving a call to an external service, it is a good idea to not have Solr be responsible for entities recognition, and let the application layer handle it.
  19. [