Does your search application include a custom query syntax with various search operators such as Booleans, proximity, term or phrase frequency, capitalization, quoted text or as-is operator, and other advanced operators? Although most search applications offer a natural language-oriented search box, some advanced applications may also offer a custom query syntax for advanced users or automated tasks. The Lucene "classic" query operators that are supported by the Solr edismax query parser (Boolean, phrase with slop, wildcard, etc.) cover a good amount of use cases, but they only get you so far. In this talk, we will explore various strategies to support a custom and advanced query syntax in Solr, covering a spectrum of options from leveraging the out-of-the-box Solr query DSL, to a custom Solr query parser, and hybrid solutions in between. We will identify the options' pros and cons, discuss relevancy considerations, and illustrate the options in Java.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Attendees with come away from this presentation with a good understanding and access to source
code for boosting and/or filtering documents by recency, popularity, and personal preferences. My
solution improves upon the common “recipe” based solution for boosting by document age. The
framework also supports boosting documents by a popularity score, which is calculated and
managed outside the index. I will present a few different ways to calculate popularity in a scalable
manner. Lastly, my solution supports the concept of a personal document collection, where each
user is only interested in a subset of the total number of documents in the index.
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
Presentation on how to and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common "recip" based solution for boosting by document age.
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
Apache Solr was always built on strong Information Retrieval/Natural Language Processing foundation. And, in recent versions, even more Artificial Intelligence features, techniques and integrations were added to the Solr.
This presentation covers some classic (and hidden gems) AI elements that Solr supported for long time as well as the most recent features that are not even fully documented yet.
The presentation was made with references to Solr 7.4.
Query relaxation - A rewriting technique between search and recommendationsRené Kriegler
Slides of my 'Haystack - The search relevance conference' talk about query relaxation. The first part gives a brief overview on strategies to help users out of zero search results situations. The second part focuses on query relaxation. I compare several algorithms that try to find the best term to be dropped from a multi-term zero-results query in order to produce results. The best solutions uses a multi-layer neural network with Word2vec as inputs to find this term.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Attendees with come away from this presentation with a good understanding and access to source
code for boosting and/or filtering documents by recency, popularity, and personal preferences. My
solution improves upon the common “recipe” based solution for boosting by document age. The
framework also supports boosting documents by a popularity score, which is calculated and
managed outside the index. I will present a few different ways to calculate popularity in a scalable
manner. Lastly, my solution supports the concept of a personal document collection, where each
user is only interested in a subset of the total number of documents in the index.
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
Presentation on how to and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common "recip" based solution for boosting by document age.
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
Apache Solr was always built on strong Information Retrieval/Natural Language Processing foundation. And, in recent versions, even more Artificial Intelligence features, techniques and integrations were added to the Solr.
This presentation covers some classic (and hidden gems) AI elements that Solr supported for long time as well as the most recent features that are not even fully documented yet.
The presentation was made with references to Solr 7.4.
Query relaxation - A rewriting technique between search and recommendationsRené Kriegler
Slides of my 'Haystack - The search relevance conference' talk about query relaxation. The first part gives a brief overview on strategies to help users out of zero search results situations. The second part focuses on query relaxation. I compare several algorithms that try to find the best term to be dropped from a multi-term zero-results query in order to produce results. The best solutions uses a multi-layer neural network with Word2vec as inputs to find this term.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
Building a real time, solr-powered recommendation engineTrey Grainger
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
N1QL gives developers and enterprises an expressive, powerful, and complete language for querying, transforming, and manipulating JSON data. We’ll begin this session with a brief overview of N1QL and then explore some key enhancements we’ve made in the latest versions of Couchbase Server. Couchbase Server 5.0 has language and performance improvements for pagination, index exploitation, integration, index availability, and more. Couchbase Server 5.5 will offer even more language and performance features for N1QL and global secondary indexes (GSI), including ANSI joins, aggregate performance, index partitioning, auditing, and more. We’ll give you an overview of the new features as well as practical use case examples.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
Building a real time, solr-powered recommendation engineTrey Grainger
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
N1QL gives developers and enterprises an expressive, powerful, and complete language for querying, transforming, and manipulating JSON data. We’ll begin this session with a brief overview of N1QL and then explore some key enhancements we’ve made in the latest versions of Couchbase Server. Couchbase Server 5.0 has language and performance improvements for pagination, index exploitation, integration, index availability, and more. Couchbase Server 5.5 will offer even more language and performance features for N1QL and global secondary indexes (GSI), including ANSI joins, aggregate performance, index partitioning, auditing, and more. We’ll give you an overview of the new features as well as practical use case examples.
Presented on 10/11/12 at the Boston Elasticsearch meetup held at the Microsoft New England Research & Development Center. This talk gave a very high-level overview of Elasticsearch to newcomers and explained why ES is a good fit for Traackr's use case.
N1QL = SQL + JSON. N1QL gives developers and enterprises an expressive, powerful, and complete language for querying, transforming, and manipulating JSON data. We begin with a brief overview. Couchbase 5.0 has language and performance improvements for pagination, index exploitation, integration, and more. We’ll walk through scenarios, features, and best practices.
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014Lucian Precup
What if we would try to make Elasticsearch SQL 92 compliant (http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt)? This wouldn't serve that much nowadays, you would say. Well, we actually tried to do the exercise and we have some interesting conclusions. While we take Elasticsearch as an example for this "side by side", the issues we are addressing also apply to nosql in general. With this unusual exercise, we take the occasion to compare relational databases / sql with Elasticsearch / nosql on all the levels : functionality, semantics, performance and user experience.
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
Presented by Remi Mikalsen, Search Engineer, The Norwegian Centre for ICT in Education
Learn how utdanning.no leverages open source technologies to deliver a blazing fast multi-faceted responsive search experience and a flexible and efficient feeds engine on top of Solr 3.6. Among the key open source projects that will be covered are Solr, Ajax-Solr, SolrPHPClient, Bootstrap, jQuery and Drupal. Notable highlights are ajaxified pivot facets, multiple parents hierarchical facets, ajax autocomplete with edge-n-gram and grouping, integrating our search widgets on any external website, custom Solr logging and using Solr to deliver Atom feeds. utdanning.no is a governmental website that collects, normalizes and publishes study information for related to secondary school and higher education in Norway. With 1.2 million visitors each year and 12.000 indexed documents we focus on precise information and a high degree of usability for students, potential students and counselors.
What makes a search engine "intelligent"? In this talk I discuss MarkLogic's full text search features and demonstrate how to enhance search functionality using MarkLogic's new Search API to deliver better, faster results automatically. You will learn how to use Search API to include indexed facets alongside results and perform query expansion to add robust automatic semantic search for known entities and expand thesaurus terms to reduce false negatives.
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
Building a real time big data analytics platform with solrTrey Grainger
Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Reflected intelligence evolving self-learning data systemsTrey Grainger
In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience. Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs
Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
Three aspects of search quality; focusing on relevance; why this is not just a technology problem; measuring search maturity & relevance; open source tools and techniques; Solr and Elasticsearch
Payloads have been a powerful aspect of Lucene for a long time, but have only had limited exposure in Solr. The Tika project has only recently finished integrating the powerful Tesseract OCR library, bringing the prospect of OCR to the masses.
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
For e-commerce applications, matching users with the items they want is the name of the game. If they can't find what they want then how can they buy anything?! Typically this functionality is provided through search and browse experience. Search allows users to type in text and match against the text of the items in the inventory. Browse allows users to select filters and slice-and-dice the inventory down to the subset they are interested in. But with the shift toward mobile devices, no one wants to type anymore - thus browse is becoming dominant in the e-commerce experience.
But there's a problem! What if your inventory is not categorized? Perhaps your inventory is user generated or generated by external providers who don't tag and categorize the inventory. No categories and no tags means no browse experience and missed sales. You could hire an army of taxonomists and curators to tag items - but training and curation will be expensive. You can demand that your providers tag their items and adhere to your taxonomy - but providers will buck this new requirement unless they see obvious and immediate benefit. Worse, providers might use tags to game the system - artificially placing themselves in the wrong category to drive more sales. Worst of all, creating the right taxonomy is hard. You have to structure a taxonomy to realistically represent how your customers think about the inventory.
Eventbrite is investigating a tantalizing alternative: using a combination of customer interactions and machine learning to automatically tag and categorize our inventory. As customers interact with our platform - as they search for events and click on and purchase events that interest them - we implicitly gather information about how our users think about our inventory. Search text effectively acts like a tag and a click on an event card is a vote for that clicked event is representative of that tag. We are able to use this stream of information as training data for a machine learning classification model; and as we receive new inventory, we can automatically tag it with the text that customers will likely use when searching for it. This makes it possible to better understand our inventory, our supply and demand, and most importantly this allows us to build the browse experience that customers demand.
In this talk I will explain in depth the problem space and Eventbrite's approach in solving the problem. I will describe how we gathered training data from our search and click logs, and how we built and refined the model. I will present the output of the model and discuss both the positive results of our work as well as the work left to be done. Those attending this talk will leave with some new ideas to take back to their own business.
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
With an increasing amount of relevancy factors, relevancy fine-tuning becomes more complex as changing the impact of factors produces increasingly more unintended side effects. In recent years, there has been a lot of discussion about how learning algorithms can replace manual relevancy fine-tuning in order to manage this complexity. However, discussions about the challenge of relevancy should additionally consider architectural aspects. Especially microservice-based architectures provide many ways to encapsulate and to separate complexities of search solutions, which facilitates optimizing the search as well as locating and fixing problems.
Generally, relevancy factors can be assigned to three different groups, each handled at a different stage of the search request processing. The first group contains contextual factors that depend on certain characteristics of a query, such as query-related boosts lifting up top-sellers for queries or category-related boosts to distinguish products from their accessories. Such contextual factors can be handled as a step of the preprocessing of queries. The respective boosting information can simply be appended to the query before it is actually sent to the search engine. Ideally, the normalization of the query is done beforehand.
The second group contains factors that are considered for all queries in more or less the same way, e. g. a ranking function basing on keyword occurrences, product topicality or sales in total. Factors related to this group can be handled directly by configuring the search engine.
The third group contains situational factors. For instance, a certain product might be a good match for a certain query in general, but for situational circumstances it should not appear among the top five products (e. g. because it is out of stock). Such situational factors can be handled by resorting result sets, after they were returned by the search engine.
The handling of the different factors within successive stages of search request processing will be discussed from an architectural perspective. Implications for applying learning algorithms and the implementation of a personalized search will be considered.
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
For a relevance engineer one of the most difficult tasks in the tuning process is to convince others in the organization that this is a joint effort. Even the brightest search guru doesn't get very far when working in isolation, so establishing cross-collaboration through the organization is essential. But how to get there?
On top of that, in a large organization a relevance engineer often works on multiple seemingly unrelated search projects. The challenge is not to get drowned in building custom solutions for each project, but to design generic and re-usable strategies which solve many problems at once.
In this session we'll discuss how to build a widely supported basis for search quality improvements in an organization. It is full of practical tips and examples which could help you in establishing a cross-functional culture that is optimal for relevance tuning. It also zooms in on an holistic approach of solving multiple equivalent search issues at once.
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
Relevance metrics like NDGC or ERR require graded judgements to evaluate query relevance performance. But what happens when we don't know what 'good' looks like ahead of time? This talk will look at using click modeling techniques to infer relevance judgements from user interaction logs.
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face.
Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics.
We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...OpenSource Connections
Evaluation of search quality is essential for developing effective rankers. Interleaved comparison methods achieve statistical significance with less data than with traditional A/B testing, meaning tests can be run in shorter timeframes and more sensitive changes to the ranker can be evaluated. In interleaved ranking two result lists are combined in a "fair" manner, such that clicks can be interpreted as unbiased judgments about the relative quality of the two rankers. In this talk we will dive into why interleaving can be a superior online evaluation method, along with how it could be added to your own evaluation toolset.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. Custom Solr Query Parser
Design Options, Pros & Cons
Haystack Training & Conference
April 22nd – 25th, 2019 • Charlottesville, VA, USA
Bertrand Rigaldies
Search Consultant
OpenSource Connections
brigaldies@o19s.com
Linkedin: bertrandrigaldies
Op: w (dist=5)
Term: Haystack Term: Rocks!
SpanNear({‘haystack’,’rock’}, 5, true)
q=Haystack w5
Rocks!
2. Haystack 2019, April 24-25
Agenda
- Query parsers’ purpose
- Query parser composition in Solr
- When do you need a custom query parser?
- How to build a custom query parser?
- Pros and cons of various design
approaches
- Beyond query parsers
2
3. Haystack 2019, April 24-25
Search engine
big picture
Documents
Search
results
Ranked,
highlighted,
Faceted
Matches
Query
Index
Credit: Doug Turnbull, "Think Like a Relevance Engineer”
training material, Day #2, Session #1
3
4. Haystack 2019, April 24-25
What’s The Problem Here?
1. [ Expression → Search Executable ] compilation
2. Query Understanding
3. How do your users search?
○ “Natural” language, as we increasingly do everyday
○ Or, a more formal search language:
■ With operators like boolean and proximity
■ Advanced custom query syntax
○ Or, some kind of hybrid of the above
End-Users Spectrum
Casual, Occasional Professional LibrarianSeasoned
4
5. Haystack 2019, April 24-25
What’s The Problem? (again)
Is it the FIRST relevancy issue in a search
application project: How do we translate the
end-user’s high-level search expression into an
executable that will most effectively
approximate what the end-user is looking for?
5
6. Haystack 2019, April 24-25
What Can We Do Out-of-the-box?
● A lot! Solr (ES too) offers powerful query parsers
out of the box:
○ “Classic” Lucene:
■ df=title, q=I love search
→ title:i title:love title:search
○ “Swiss Army Knife” edismax:
■ qf=title body, q=I love search
→ +( (title:i | body:i)
(title:love | body:love)
(title:search | body:search)
) 6
7. Haystack 2019, April 24-25
How far can I go?
Search for the capitalized term “Green”, but not
the adjective “green”, that is 5 positions or less
before the noun “deal”.
{!lucene} “green deal”~5
{!surround} green 5w deal
{!surround} 5w(2w(green,deal), congress OR
legislation)
_query_:”{!cap}firstcap(green)” AND
_query_:”{!proximity}green 5w deal”
7
8. Haystack 2019, April 24-25
Query Parsers Composition
● Solr provides a large variety of QPs (28 and
counting, JSON Query DSL), that are
composable:
_query_:"{!lucene}"green deal""
AND
_query_:"{!surround} 5n(congress,
democrat)"
8
11. Haystack 2019, April 24-25
What If We Need To Go Beyond?
● There are limitations and quirks, e.g., the
Solr “Surround” QP:
○ Distance <= 99;
○ Search terms are not analyzed! What?
● What about operators that do not exit?
○ Capitalization: Match Green, but not green
○ Frequency: Must match N times or less
○ As-is: Search for a term as written.
● What do we do now? Enter the world of
custom query parsers! 11
12. Haystack 2019, April 24-25
Demo: Let’s build a simple proximity query parser!
… CVille Haystack w5 Rocks 2019 ...
- Analyze terms
- Distance >= 0, no upper limit
- Operator: Same as surround (w<dist>, n<dist>)
- https://github.com/o19s/solr-query-parser-demo
12
13. Haystack 2019, April 24-25
Query Parser Plugin Anatomy
ProximityQParserPlugin.java:
public class ProximityQParserPlugin extends QParserPlugin {
public QParser createParser(String s, SolrParams localParams, SolrParams
globalParams, SolrQueryRequest solrQueryRequest) {
return new ProximityQParser(s, localParams, globalParams,
solrQueryRequest);
}
}
In solrconfig.xml:
<queryParser name="proximity"
class="com.o19s.solr.qparser.ProximityQParserPlugin"/>
<requestHandler name="/proximity" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">proximity</str>
... 13
QP “Factory” Class
Solr Config
19. Haystack 2019, April 24-25
Query Parser Strategies
“Natural” Query Language
Application
Search box:
green deal
Solr
q={!edismax} green deal
QP: edismax
Custom Query Language
(Moderate Complexity)
Application
Search box:
dog near/5 house
Solr
QP: surround
q={!surround} green 5n deal
Custom Query Language
(Any Complexity)
Application
Search box:
cap(green) near/5 deal
Solr
QP: MyQP
q={!myqp} cap(green) near/5 deal
19
QP: MyQP
20. Haystack 2019, April 24-25
Query Parser Strategies Comparison
Criteria edismax
Solr QPs
Composition
Custom QP
Software R&D No Moderate High
20
Ease of Solr
upgrade
Very Good Good To be managed
Performance Good Good Better vs. Solr
QPs composition
But be careful!
Deployment - - Plugin jar(s)
Ease of
Relevancy Tuning
The good ol’
edismax
Individual QPs’
knobs and dials
More software to
write!
21. Haystack 2019, April 24-25
Entities Recognition vs. Query Parsing
Search Requests
Load Balancer
...Solr
Node
MyQP
Solr
Node
MyQP
Solr
Node
MyQP
Solr
Node
MyQP
Load Balancer
Entities
Recognition
Service
Search
Service
Search
Service
Search
Service
...
21
22. Haystack 2019, April 24-25
Closing Remarks
● QPs are a lot of fun, BUT:
○ Make sure you really need to go beyond the out-
the-box features!
○ Great power comes with great responsibility.
Careful what you write!
○ Relevancy knobs and dials can be tricky to re-
implement: Multi-field, term- vs. fields-centric, mm,
field boosting, etc.
● The next frontier: Custom Lucene queries
○ Multi-terms synonyms w/ equalized scoring
○ Frequency operators 22
Editor's Notes
Good morning everyone.
My name is Bertrand Rigaldies.
I am an OpenSource Connections search consultant.
I joined OSC in early 2017.
I have worked primarily in Solr, working on a variety of search relevancy issues.
Lately, I have been very fortunate to work on a custom query parser for a great client, represented by some of you in the audience. A large part of this talk has been inspired by this work.
In this talk I would like to share with you my experience with Query Parsers: Why and when do we need them? How to write one? The different design and implementation options, their pros and cons, and some pitfalls.
This talk is definitely more engineering than science, more back-to-the-fundamentals than let’s-re-invent-search. It’s a let’s lift the hood and see the different nuts and bolts of Query Parsers. Hopefully the talk will give some ideas you can take back to your jobs.
Quick polling of the audience on the topic:
Raise your arm if you have no idea or only a vague idea of what a query parser is
Raise your arm if you have participated in the development of a custom query parser?
Raise your arm if you are a developer, and/or you’re comfortable with Java?
Some basics first. Let’s locate the Query Parser, as an architecture component, in the overall Solr (or ES) architecture.
This slide was borrowed and adapted from the OSC Solr training material.
Talking Points:
Where does query parsers belong in the big picture of a search engine?
Left side handles documents indexing
Right side handles querying
The concentric circles should be from center going out
Query goes through the following 4 concentric circles of processing:
Normalization of the search terms (tokenization, filters, etc.)
Matching and ranking, which is responsibility of query parsers
Decoration, such as snipetting, highlighting, term vectors spell checking
Analytics, such as facets
In this talk we’ll be focusing on the Matching and Ranking ring.
SLIDE CAPTION ONLY
So, what is the problem?
Well, with Query Parsers, we are staring at one of the core challenges of search engines: How to understand the text that the end-user typed or spoke, and turn it into code that can be executed to search.
Click NEXT
The first part of the problem is essentially the classic problem of compiling a high-level formalism (e.g., everyday language English, or other more formal form of search expressions) to low-level executable code).
For the most part, the first issue is well understood from a computer science standpoint. There are several great tools to generate compilers (javacc, Antler, etc.; Note: The Lucene classic search language is implemented in javacc). And, as we’ll see, Lucene provides a rich set of search primitives that we can use to create executable search code.
CLICK NEXT
The second part of the problem is more challenging, and has to do with “query understanding”: What is the end-user saying, and what is the appropriate executable search construct? The Holy Grail of search is to search what the end-user means, not what he/she typed. Well, we haven’t invented a compiler that can do that yet! Ha ha.
Now, more practically for our applications design, we should ask ourselves how end-users will search. That understanding will inform how to parse the text they type or speak.
NEXT, NEXT, etc.
So-called “Natural” language, a la Google
Or, more formal languages:
Boolean and proximity like the Classic Lucene syntax
More advanced than the Classic, with operators like as-is, capitalization, clause frequency, etc.
Or, some kind of hybrid
So, to wrap up this context-setting slide: At a philosophical level, this PROBLEM may the FIRST relevancy issue in a search application project: How do we translate the end-user’s high-level search expression into an executable that will most effectively approximate what the end-user is looking for?
ANIMATION!
What can we do in Solr (or ES) in order to address the problem?
Good news is: Solr offers powerful out-of-the-box Query Parsers.
The edismax is like a power tool: With it comes great responsibility. And Doug wrote three chapters on the subtleties and pros and cons of multi-field queries, and the pros and cons of terms- vs. fields-centric approaches. I spent a year tuning a system using the edismax with many fields. It’s hard work! Which requires a mature relevancy testing infrastructure by the way, but you knew that.
Ask the audience who has been using the edismax in their applications?
There is a very rich query parsers eco-system in Solr: Solr 7.7 Other Query Parsers
But, how far can I go with the Solr query parsers?
Pretty far actually!
For example, in terms of queries specifying some proximity between terms, there is little-known query parser called the “surround” query parser. Check it out in the Solr doc. It’s implemented in Lucene by a separate javacc grammar (See the package org.apache.lucene.queryparser.surround.parser).
Solr demo: http://localhost:8983/solr/demo/select?debugQuery=on&df=title_t&q=%7B!surround%7D%205n(2n(donald%2Ctrump)%2C%20impeached%20OR%20impeachment)&wt=json
TODO: Change to an example that is not (too) political
Green legislation
Search for the capitalized term “Green” (as if Green New Deal), but not the color “green”, which is within X positions of the term “legislation”
cap(green) w/5 legislation
Note: the position count in the surround operator is “slop + 1” (Number of positions between terms + 1)
So, the Solr toolbox provides many query parsers that can be combined in arbitrarily complex compositions.
Solr demo: http://localhost:8983/solr/demo/select?debugQuery=on&df=title_t&q=_query_%3A%22%7B!lucene%7D%5C%22green%20deal%5C%22%22%20%0AAND%20%0A_query_%3A%22%7B!surround%7D%205n(congress%2C%20democrat)%22&wt=json
Example of QPs composition with XML.
Show on Postman > Haystack 2019 > XML Query Parser
Demo: Postman
Example of QPs composition with ES-esque JSON Query DSL
Show on Postman > Haystack 2019 > JSON Query Parser
Demo
ANIMATION!
NEXT: But there are limitations that could be showstoppers to meet your functional requirements.
E.g., in the surround QP
NEXT: What about operators that do not exist in Lucene or Solr?
NEXT :Enter the world of custom query parsers...
Show, explain, and run the code from IntelliJ.
We’re going to improve the surround QP in two areas:
Analyze the search terms
Not have any distance limitation
Quick Query Parser anatomy with a couple of slides and then we’ll do a high-level code walk through.
This is Java code, so hopefully we’re okay with that. My apology to those in the room that will have a hard time seeing the code. But, one takeaway should be for all that there is an easy and convenient set of Solr plugin patterns as well as Lucene search primitives that make the creation of custom query parsers very approachable for you all.
Show request handler in IntelliJ.
Note that the query parser does not execute the query. Its sole responsibility is to produce the executable query and return to Solr.
Good news is that, as a query parser writer, we don’t have to worry about query execution concerns such as: filtering, pagination, highlights, facets, boosting (more on that later).
Let’s walk through the code (next slide).
Go to the Solr UI to play with our proximity query parser
FIRST, show the plugin in action:
Query with it: {!proximity} GREEN w5 Deal http://localhost:8983/solr/demo/select?debugQuery=on&fl=*&q=%7B!proximity%7D%20GREEN%20w5%20Deal&qf=title_t&wt=json
Analyzed search terms (Donald and IMPEACHED analyzed to trump and impeached)
No limit to the distance (100)
Show the generate Lucene query with debugQuery=on
Show where the jar is deployed (Normally pushed to an artifacts repository, and deployed to your Solr nodes)
Show the plugin is listed: http://localhost:8983/solr/#/demo/plugins?type=queryparser&entry=com.o19s.solr.qparser.ProximityQParserPlugin
Query using the request handler /proximity:
No match (mm=100, “democrat” does not match because we have no stemmer): http://localhost:8983/solr/demo/proximity?=&debugQuery=on&fl=*&mm=100&q=green%20new%20w5%20deal%20democrat&qf=title_t&wt=json
mm=50%: http://localhost:8983/solr/demo/proximity?=&debugQuery=on&fl=*&mm=50&q=green%20new%20w5%20deal%20democrat&qf=title_t&wt=json
Show Boosting, Highlights, Sort, Facets
SCORING? You get the behavior of the underlying Lucene primitives and how they are composed?
Go to Splainer live: http://splainer.io/#?solr=http:%2F%2Flocalhost:8983%2Fsolr%2Fdemo%2Fproximity%3FdebugQuery%3Don%26q%3DDonald%20w100%20IMPEACHED%26qf%3Dtitle_t&fieldSpec=*
Just showing that the scoring is provided by the underlying Lucene primitives.
TIDBITS:
IDF(phrase) = sum of the IDFs of the phrase’s terms
The span’s “phrase frequency” is 1 / (distance + 1)
The span frequency (0.333…) is calculated in the BM25Similarity class (See line 77 in the 7_7 branch):
protected float sloppyFreq(int distance) {
return 1.0f / (distance + 1);
}
Can the score be customized? Yes, but that involves peeling the next layer of the Lucene onion and get into the Weigh and Scorer classes. A presentation for another time.
Also, show Quepid is fine.
Recap: Different approaches to QPs.
Notice the location of the Query Parser in the center solution: It is in the application! The application parses the end-user’s string and produces a Solr search using the Perl-like notation, XML, or JSON.
Ease of Relevance Tuning:
Edismax: the good, bad, and ugy
QPs composition: Underlying QPs’ knobs and dials
Custom Query Parser: More software to write:
Multi-fields
Synonyms
Compounds
DisMax behavior
Terms- vs fields-centric
Tie
mm
etc.
Often times, entities such as dates, numbers, people, places, institutions, etc. must be recognized as a first-pass before producing the parse tree so that the recognized entities are leaves themselves.
In a large SolrCloud cluster, if the entity recognition is an expensive operation, perhaps involving a call to an external service, it is a good idea to not have Solr be responsible for entities recognition, and let the application layer handle it.