In this talk we’ll cover the basics of search relevancy in elasticsearch from how relevancy is calculated and modeled to modifying query structure, setting up analyzer chains and how to measure incremental improvements. The talk will highlight several real world relevancy scenarios encountered in the consulting work at KMW Technology, a leading provider of search professional services to major organizations.
Building a Real-time Solr-powered Recommendation Enginelucenerevolution
Presented by Trey Grainger | CareerBuilder - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
1. LDA represents documents as mixtures of topics and topics as mixtures of words.
2. It assumes documents are generated by first choosing a topic distribution, then choosing words from that topic.
3. The algorithm estimates topic distributions for each document and word distributions for each topic that are most likely to have generated the observed document-word matrix.
Audio available: https://www.liferay.com/web/events-symposium-north-america/recap
Liferay makes it easy to integrate your application with powerful search engines. However, it may be hard to diagnose why your most important content isn't showing up the way you need it to. This session will recap the key concepts for indexing and querying with Liferay Search, and present a number of techniques to guarantee your documents will be found with best possible relevance.
André de Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been a Java developer and architect for the last 15 years. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience. Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs
Enhancing relevancy through personalization & semantic searchlucenerevolution
I. The document discusses how CareerBuilder uses Solr for search at scale, handling over 1 billion documents and 1 million searches per hour across 300 servers.
II. It then covers traditional relevancy scoring in Solr, which is based on TF-IDF, as well as ways to boost documents, fields, and terms.
III. Advanced relevancy techniques are described, including using custom functions to incorporate domain-specific knowledge into scoring, and context-aware weighting of relevancy parameters. Personalization and recommendation approaches are also summarized, including attribute-based and collaborative filtering methods.
Enhancing relevancy through personalization & semantic searchTrey Grainger
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
While traditional keyword search is still useful, pure text-based keyword matching is quickly becoming obsolete; today, it is a necessary but not sufficient tool for delivering relevant results and intelligent search experiences.
In this talk, we'll cover some of the emerging trends in AI-powered search, including the use of thought vectors (multi-level vector embeddings) and semantic knowledge graphs to contextually interpret and conceptualize queries. We'll walk through some live query interpretation demos to demonstrate the power that can be delivered through these semantic search techniques leveraging auto-generated knowledge graphs learned from your content and user interactions.
Building a Real-time Solr-powered Recommendation Enginelucenerevolution
Presented by Trey Grainger | CareerBuilder - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
1. LDA represents documents as mixtures of topics and topics as mixtures of words.
2. It assumes documents are generated by first choosing a topic distribution, then choosing words from that topic.
3. The algorithm estimates topic distributions for each document and word distributions for each topic that are most likely to have generated the observed document-word matrix.
Audio available: https://www.liferay.com/web/events-symposium-north-america/recap
Liferay makes it easy to integrate your application with powerful search engines. However, it may be hard to diagnose why your most important content isn't showing up the way you need it to. This session will recap the key concepts for indexing and querying with Liferay Search, and present a number of techniques to guarantee your documents will be found with best possible relevance.
André de Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been a Java developer and architect for the last 15 years. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience. Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs
Enhancing relevancy through personalization & semantic searchlucenerevolution
I. The document discusses how CareerBuilder uses Solr for search at scale, handling over 1 billion documents and 1 million searches per hour across 300 servers.
II. It then covers traditional relevancy scoring in Solr, which is based on TF-IDF, as well as ways to boost documents, fields, and terms.
III. Advanced relevancy techniques are described, including using custom functions to incorporate domain-specific knowledge into scoring, and context-aware weighting of relevancy parameters. Personalization and recommendation approaches are also summarized, including attribute-based and collaborative filtering methods.
Enhancing relevancy through personalization & semantic searchTrey Grainger
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
While traditional keyword search is still useful, pure text-based keyword matching is quickly becoming obsolete; today, it is a necessary but not sufficient tool for delivering relevant results and intelligent search experiences.
In this talk, we'll cover some of the emerging trends in AI-powered search, including the use of thought vectors (multi-level vector embeddings) and semantic knowledge graphs to contextually interpret and conceptualize queries. We'll walk through some live query interpretation demos to demonstrate the power that can be delivered through these semantic search techniques leveraging auto-generated knowledge graphs learned from your content and user interactions.
This document provides best practices and tips for implementing the +1 button on websites. It discusses why the +1 button is useful for increasing engagement and search traffic. The basics of integrating the JavaScript and HTML for the +1 button are covered, as well as best practices like placing buttons near sharable content and using canonical URLs. Advanced options for button sizes, parameters, and explicit loading are also summarized. Lastly, upcoming social analytics tools in Google Webmaster Tools and Google Analytics for measuring +1 button performance are outlined.
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case. This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
This document discusses using facets in Solr to facilitate relevant search. It provides an overview of facet history and how facets represent metadata that provides context about search results. Facets can be used for visualization, analytics, and understanding language semantics from text. The document argues that facets are dynamic context discovery tools that can be leveraged to find similar items and enhance search in various ways such as query autofiltering, typeahead suggestions, and text analytics.
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
This document discusses direct concept search using word embeddings. It describes mapping query and index terms to vector representations in a conceptual space to improve recall by expanding queries with related concepts. Word2vec is used to generate 127-dimensional word embeddings from Wikipedia text. The embeddings are indexed in Lucene to enable nearest neighbor search. Queries are expanded by searching for terms nearest to query terms in the embedding space. While building high-dimensional point indexes is slow in Lucene, this approach demonstrates the potential of using word embeddings for query expansion in information retrieval.
PHASE (Philly Area Scala Enthusiasts) - Word2vec in Scala. Talk explains concrete examples of how Word2vec works, built around a demo of constructing email alerts using concept search.
The document discusses various techniques for summarizing search results and detecting duplicate web pages. It describes static and dynamic summaries, where static summaries are always the same regardless of the query and dynamic summaries are query-dependent. It also covers different methods for generating static and dynamic summaries, as well as challenges in producing good dynamic summaries. The document then discusses various spam techniques used by search engine optimizers and the arms race between SEOs and search engines. It concludes by outlining approaches for detecting near-duplicate and mirrored web pages.
A quick overview of Elasticsearch usage at Dailymotion for video search
Talk given at Elasticsearch Meetup France #7
June 10, 2014
http://www.meetup.com/elasticsearchfr/events/171946592/
This document discusses sorting and relevance in ElasticSearch. It provides examples of sorting search results by date or score. It also covers multilevel sorting, sorting on multivalue fields, and sorting on string fields after analyzing or not analyzing text. The document explains what determines relevance in ElasticSearch, including term frequency, inverse document frequency, and field length norm. It shows how to get explain plans and failure messages for queries. Finally, it provides a brief introduction to doc values in ElasticSearch and references a book for further information.
Elasticsearch document summarization:
- Elasticsearch calculates relevance scores (scores queries against documents) using a complex algorithm involving term frequency, inverse document frequency, field length norm, and other factors.
- It analyzes queries and documents, finds matching documents, then scores each match based on the algorithm. Higher scores indicate more relevant matches.
- Explain API and scoring details can show how Elasticsearch calculates relevance for specific queries and documents, breaking down factors like TF, IDF, coordination, etc. This helps understand and optimize search effectiveness.
This document provides an overview of Elasticsearch, including what it is, how it works, and how to perform basic operations like indexing, updating, and searching documents. It explains that Elasticsearch allows for advanced search across large amounts of data by making documents searchable and scaling easily. It also demonstrates how to index, update, search for, and retrieve documents through RESTful API calls. Faceted search, aggregations, and cluster architecture are also summarized.
The document describes how Sphinx, an open source full-text search engine, was used to optimize searching and reporting on a large dataset of over 160 million cross-links. The data was partitioned across 8 servers each with 4 Sphinx instances and 2 indexes. Queries were run in parallel across the instances to return results faster than could be achieved with a single database, with average query times under 0.125 seconds and 95% of queries returning under 0.352 seconds. The document outlines the partitioning, indexing, and querying approach used to optimize performance for the dataset.
Elasticsearch is an open-source, distributed, real-time document indexer with support for online analytics. It has features like a powerful REST API, schema-less data model, full distribution and high availability, and advanced search capabilities. Documents are indexed into indexes which contain mappings and types. Queries retrieve matching documents from indexes. Analysis converts text into searchable terms using tokenizers, filters, and analyzers. Documents are distributed across shards and replicas for scalability and fault tolerance. The REST APIs can be used to index, search, and inspect the cluster.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
This presentation is from the inaugural Atlanta Solr Meetup held on 2014/10/21 at Atlanta Tech Village.
Description: CareerBuilder uses Solr to power their recommendation engine, semantic search, and data analytics products. They maintain an infrastructure of hundreds of Solr servers, holding over a billion documents and serving over a million queries an hour across thousands of unique search indexes. Come learn how CareerBuilder has integrated Solr into their technology platform (with assistance from Hadoop, Cassandra, and RabbitMQ) and walk through api and code examples to see how you can use Solr to implement your own real-time recommendation engine, semantic search, and data analytics solutions.
Speaker: Trey Grainger is the Director of Engineering for Search & Analytics at CareerBuilder.com and is the co-author of Solr in Action (2014, Manning Publications), the comprehensive example-driven guide to Apache Solr. His search experience includes handling multi-lingual content across dozens of markets/languages, machine learning, semantic search, big data analytics, customized Lucene/Solr scoring models, data mining and recommendation systems. Trey is also the Founder of Celiaccess.com, a gluten-free search engine, and is a frequent speaker at Lucene and Solr-related conferences.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
This document discusses search interfaces and principles. It begins with an introduction to the presenter and then covers topics like how search engines work, principles of good search design, and common front-end search patterns. Specific concepts discussed include indexing text, query analysis, scoring and ranking documents, filtering results, aggregations, autocomplete, highlighting search terms, and loading more results. The overall message is that search provides a powerful and flexible way to return relevant content to users.
This document discusses using Elasticsearch for social media analytics and provides examples of common tasks. It introduces Elasticsearch basics like installation, indexing documents, and searching. It also covers more advanced topics like mapping types, facets for aggregations, analyzers, nested and parent/child relations between documents. The document concludes with recommendations on data design, suggesting indexing strategies for different use cases like per user, single index, or partitioning by time range.
Declarative Multilingual Information Extraction with SystemTdiannepatricia
"Declarative Multilingual Information Extraction with SystemT" presented by Laura Chiticariu, IBM Research - Almaden as part of the Cognitive Systems Institute Speaker Series.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
This document discusses search interfaces and principles. It begins by introducing Daniel Beach and his work in search. It then covers general search principles like using search as a conversation with users and focusing on relevance over design. Various search techniques are explained, including indexing, query analysis, result scoring, filtering, aggregations, autocomplete, highlighting and loading more results. The document emphasizes that search provides flexibility to return relevant content given user inputs.
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the solutions architecture team, I will share common mistakes observed as well as tips and tricks to avoiding them.
This document discusses various techniques for indexing and ranking documents in information retrieval systems like search engines. It covers vector space models that represent documents as vectors of words, tf-idf weighting to measure word importance, cosine similarity for comparing document-query vectors, and PageRank for ranking pages based on link analysis and simulating random walks through the link graph. Modern search engines use learning to rank approaches that combine these factors and user click data to personalize rankings for different users and contexts.
This document provides best practices and tips for implementing the +1 button on websites. It discusses why the +1 button is useful for increasing engagement and search traffic. The basics of integrating the JavaScript and HTML for the +1 button are covered, as well as best practices like placing buttons near sharable content and using canonical URLs. Advanced options for button sizes, parameters, and explicit loading are also summarized. Lastly, upcoming social analytics tools in Google Webmaster Tools and Google Analytics for measuring +1 button performance are outlined.
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case. This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
This document discusses using facets in Solr to facilitate relevant search. It provides an overview of facet history and how facets represent metadata that provides context about search results. Facets can be used for visualization, analytics, and understanding language semantics from text. The document argues that facets are dynamic context discovery tools that can be leveraged to find similar items and enhance search in various ways such as query autofiltering, typeahead suggestions, and text analytics.
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
This document discusses direct concept search using word embeddings. It describes mapping query and index terms to vector representations in a conceptual space to improve recall by expanding queries with related concepts. Word2vec is used to generate 127-dimensional word embeddings from Wikipedia text. The embeddings are indexed in Lucene to enable nearest neighbor search. Queries are expanded by searching for terms nearest to query terms in the embedding space. While building high-dimensional point indexes is slow in Lucene, this approach demonstrates the potential of using word embeddings for query expansion in information retrieval.
PHASE (Philly Area Scala Enthusiasts) - Word2vec in Scala. Talk explains concrete examples of how Word2vec works, built around a demo of constructing email alerts using concept search.
The document discusses various techniques for summarizing search results and detecting duplicate web pages. It describes static and dynamic summaries, where static summaries are always the same regardless of the query and dynamic summaries are query-dependent. It also covers different methods for generating static and dynamic summaries, as well as challenges in producing good dynamic summaries. The document then discusses various spam techniques used by search engine optimizers and the arms race between SEOs and search engines. It concludes by outlining approaches for detecting near-duplicate and mirrored web pages.
A quick overview of Elasticsearch usage at Dailymotion for video search
Talk given at Elasticsearch Meetup France #7
June 10, 2014
http://www.meetup.com/elasticsearchfr/events/171946592/
This document discusses sorting and relevance in ElasticSearch. It provides examples of sorting search results by date or score. It also covers multilevel sorting, sorting on multivalue fields, and sorting on string fields after analyzing or not analyzing text. The document explains what determines relevance in ElasticSearch, including term frequency, inverse document frequency, and field length norm. It shows how to get explain plans and failure messages for queries. Finally, it provides a brief introduction to doc values in ElasticSearch and references a book for further information.
Elasticsearch document summarization:
- Elasticsearch calculates relevance scores (scores queries against documents) using a complex algorithm involving term frequency, inverse document frequency, field length norm, and other factors.
- It analyzes queries and documents, finds matching documents, then scores each match based on the algorithm. Higher scores indicate more relevant matches.
- Explain API and scoring details can show how Elasticsearch calculates relevance for specific queries and documents, breaking down factors like TF, IDF, coordination, etc. This helps understand and optimize search effectiveness.
This document provides an overview of Elasticsearch, including what it is, how it works, and how to perform basic operations like indexing, updating, and searching documents. It explains that Elasticsearch allows for advanced search across large amounts of data by making documents searchable and scaling easily. It also demonstrates how to index, update, search for, and retrieve documents through RESTful API calls. Faceted search, aggregations, and cluster architecture are also summarized.
The document describes how Sphinx, an open source full-text search engine, was used to optimize searching and reporting on a large dataset of over 160 million cross-links. The data was partitioned across 8 servers each with 4 Sphinx instances and 2 indexes. Queries were run in parallel across the instances to return results faster than could be achieved with a single database, with average query times under 0.125 seconds and 95% of queries returning under 0.352 seconds. The document outlines the partitioning, indexing, and querying approach used to optimize performance for the dataset.
Elasticsearch is an open-source, distributed, real-time document indexer with support for online analytics. It has features like a powerful REST API, schema-less data model, full distribution and high availability, and advanced search capabilities. Documents are indexed into indexes which contain mappings and types. Queries retrieve matching documents from indexes. Analysis converts text into searchable terms using tokenizers, filters, and analyzers. Documents are distributed across shards and replicas for scalability and fault tolerance. The REST APIs can be used to index, search, and inspect the cluster.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
This presentation is from the inaugural Atlanta Solr Meetup held on 2014/10/21 at Atlanta Tech Village.
Description: CareerBuilder uses Solr to power their recommendation engine, semantic search, and data analytics products. They maintain an infrastructure of hundreds of Solr servers, holding over a billion documents and serving over a million queries an hour across thousands of unique search indexes. Come learn how CareerBuilder has integrated Solr into their technology platform (with assistance from Hadoop, Cassandra, and RabbitMQ) and walk through api and code examples to see how you can use Solr to implement your own real-time recommendation engine, semantic search, and data analytics solutions.
Speaker: Trey Grainger is the Director of Engineering for Search & Analytics at CareerBuilder.com and is the co-author of Solr in Action (2014, Manning Publications), the comprehensive example-driven guide to Apache Solr. His search experience includes handling multi-lingual content across dozens of markets/languages, machine learning, semantic search, big data analytics, customized Lucene/Solr scoring models, data mining and recommendation systems. Trey is also the Founder of Celiaccess.com, a gluten-free search engine, and is a frequent speaker at Lucene and Solr-related conferences.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
This document discusses search interfaces and principles. It begins with an introduction to the presenter and then covers topics like how search engines work, principles of good search design, and common front-end search patterns. Specific concepts discussed include indexing text, query analysis, scoring and ranking documents, filtering results, aggregations, autocomplete, highlighting search terms, and loading more results. The overall message is that search provides a powerful and flexible way to return relevant content to users.
This document discusses using Elasticsearch for social media analytics and provides examples of common tasks. It introduces Elasticsearch basics like installation, indexing documents, and searching. It also covers more advanced topics like mapping types, facets for aggregations, analyzers, nested and parent/child relations between documents. The document concludes with recommendations on data design, suggesting indexing strategies for different use cases like per user, single index, or partitioning by time range.
Declarative Multilingual Information Extraction with SystemTdiannepatricia
"Declarative Multilingual Information Extraction with SystemT" presented by Laura Chiticariu, IBM Research - Almaden as part of the Cognitive Systems Institute Speaker Series.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
This document discusses search interfaces and principles. It begins by introducing Daniel Beach and his work in search. It then covers general search principles like using search as a conversation with users and focusing on relevance over design. Various search techniques are explained, including indexing, query analysis, result scoring, filtering, aggregations, autocomplete, highlighting and loading more results. The document emphasizes that search provides flexibility to return relevant content given user inputs.
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the solutions architecture team, I will share common mistakes observed as well as tips and tricks to avoiding them.
This document discusses various techniques for indexing and ranking documents in information retrieval systems like search engines. It covers vector space models that represent documents as vectors of words, tf-idf weighting to measure word importance, cosine similarity for comparing document-query vectors, and PageRank for ranking pages based on link analysis and simulating random walks through the link graph. Modern search engines use learning to rank approaches that combine these factors and user click data to personalize rankings for different users and contexts.
Describes techniques for injecting "Semantic Intelligence" into search applications. Focuses on Apache Solr and Lucidworks Fusion, but these techniques are generally applicable to any search engine because all of them use the same basic mechanism - inverted token mapping at their 'core'.
Radu Gheorghe gives an introduction to Solr, an open source search engine based on Apache Lucene. He discusses when Solr would be used, such as for product search, as well as when it may not be suitable, such as for sparse data. The presentation covers how Solr works with inverted indexes and scoring documents, as well as features like facets, streaming aggregations, master-slave and SolrCloud architectures. A demo is offered to illustrate Solr functionality.
Elasticsearch is an open source search engine based on Lucene. It allows for distributed, highly available, and real-time search and analytics of documents. Documents are indexed and stored across multiple nodes in a cluster, with the ability to scale horizontally by adding more nodes. Elasticsearch uses an inverted index to allow fast full-text searches of documents.
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
React.js, a JavaScript library developed by Facebook, has gained immense popularity for building user interfaces, especially for single-page applications. Over the years, React has evolved and expanded its capabilities, becoming a preferred choice for mobile app development. This article will explore why React.js is an excellent choice for the Best Mobile App development company in Noida.
Visit Us For Information: https://www.linkedin.com/pulse/what-makes-reactjs-stand-out-mobile-app-development-rajesh-rai-pihvf/
🏎️Tech Transformation: DevOps Insights from the Experts 👩💻campbellclarkson
Connect with fellow Trailblazers, learn from industry experts Glenda Thomson (Salesforce, Principal Technical Architect) and Will Dinn (Judo Bank, Salesforce Development Lead), and discover how to harness DevOps tools with Salesforce.
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...Luigi Fugaro
Vector databases are transforming how we handle data, allowing us to search through text, images, and audio by converting them into vectors. Today, we'll dive into the basics of this exciting technology and discuss its potential to revolutionize our next-generation AI applications. We'll examine typical uses for these databases and the essential tools
developers need. Plus, we'll zoom in on the advanced capabilities of vector search and semantic caching in Java, showcasing these through a live demo with Redis libraries. Get ready to see how these powerful tools can change the game!
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
DevOps Consulting Company | Hire DevOps Servicesseospiralmantra
Spiral Mantra excels in providing comprehensive DevOps services, including Azure and AWS DevOps solutions. As a top DevOps consulting company, we offer controlled services, cloud DevOps, and expert consulting nationwide, including Houston and New York. Our skilled DevOps engineers ensure seamless integration and optimized operations for your business. Choose Spiral Mantra for superior DevOps services.
https://www.spiralmantra.com/devops/
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...The Third Creative Media
"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISTier1 app
Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
Transforming Product Development using OnePlan To Boost Efficiency and Innova...OnePlan Solutions
Ready to overcome challenges and drive innovation in your organization? Join us in our upcoming webinar where we discuss how to combat resource limitations, scope creep, and the difficulties of aligning your projects with strategic goals. Discover how OnePlan can revolutionize your product development processes, helping your team to innovate faster, manage resources more effectively, and deliver exceptional results.
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
2. Outline
• Intro to Relevance
• Crash Course: Scoring
• Relevance Tuning Case Study
• Testing Relevance
• Discussion
3. What is Relevance?
• A subjective measure of how useful a document is to user
who searched for something
• Does it satisfy the user’s information need?
• If I search for “cats”…
• Probably relevant: the movie “Cats,” the stage musical
“Cats,” cat pictures, cat blogs, cat food, Felis catus
• Vaguely relevant: dogs
• Not relevant: CAT scanners, catsup, cement mixers
5. What Is Relevance Tuning?
Adjusting the content of search results so that the most
relevant documents are included
Adjusting the order of search results so that the most
relevant results appear on top
7. Why Tune Relevance?
• FANTASY: “Once I get the data into my search engine, it
does all the work of finding the best matches for my
queries.”
• TRUTH: “We have to configure the search engine to rank
results in a way that is meaningful to the user.”
8. Search Engine Doesn't
Know…
• Which fields are important
• How users will search those fields
• Which query terms are the most significant
• Whether term order is significant
• Which terms mean the same thing
• What priorities the user has based on location, season, task, etc.
• What priorities the provider has re: sales, promotions, sponsorships, etc.
• Whether freshness, popularity, ratings are important
9. Relevance Problems
• Search for “Rocky" returns “Rocky Road To Dublin”
before the movie “Rocky”
• Search in MA for "coffee" returns “Coffee Day” (chain in
India) before “George Howell Coffee”
• Search for product by SKU returns permutations
• Search for “bikes” fails to find “bicycle”
• Search for “The The” (band) returns no results
10. Precision and Recall
• High Precision: "Everything I see is useful to me"
• High Recall: “Everything I might want is included”
• Relevance tuning is a tradeoff between precision and
recall
11. Precision And Recall
• Precision = Relevant Results / All Results
• “Only 5 out of 10 results returned were useful to me.
There was a lot of noise.”
• Recall = Relevant Results / All Relevant Documents
• “Only 5 out of 10 useful documents in the index were
returned. There were lots of things missing.”
12. When relevance you want to tune,
All iffy results you should prune
To achieve good precision,
Unless your decision’s
That recall is more of a boon.
Precision and Recall
13. How do we tune relevance?
• Enrich documents with metadata that's useful to search
• Search the right fields
• Configure field analyzers to match the way users search
• Set field weights
• Match phrases
• Handle typos
• Apply synonyms and stemming
• Reward exact/complete matches
• Reward freshness, popularity, ratings, etc.
14. Scoring
• A search engine has to find relevant documents without
knowing what they mean
• A search engine assigns a numerical score to each match
using a "blind" but effective statistical heuristic
• Results are displayed in order by score
• To tune relevance we need to understand the search
engine’s built-in method of scoring
25. Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog”
Doc 2: “cat cat cat cat cat cat cat”
Doc 3: “dog cat”
26. Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog” → 0.8
Doc 2: “cat cat cat cat cat cat cat” → 0.8
Doc 3: “dog cat” → 1.2
Matching more query terms is good
Term
Saturation
32. Ties
“when two documents have the same score, they will be sorted by their
internal Lucene doc id (which is unrelated to the _id) by default”
“The internal doc_id can differ for the same document inside each
replica of the same shard so it's recommended to use another
tiebreaker for sort in order to get consistent results. For instance you
could do: sort: ["_score", "datetime"] to force top_hits to
rank documents based on score first and use datetime as a
tiebreaker.”
"sort": [
{ "_score": { "order": "desc" }},
{ "date": { "order": "desc" }}
]
33. Comparing Field Scores
• Raw scores across fields are not directly comparable
• Term frequencies, document frequencies, and average field length all
differ across fields
• Field analyzers can generate additional tokens that that affect scoring
• A "good" match in one field might score in the range 0.1 to 0.2 while a
good match in another field might score in the range 1 to 2. There’s
no universal relevance scale.
• A multiplicative boost of 10 doesn't mean “field1 is 10 times more
important than field2”
• Boosts can compensate for scale discrepancies
34. TF x IDF
A search engine handles the chore
Of ranking each match, good or poor:
If a document’s TF
Divided by DF
Is huge, it will get the top score.
35. How does TFxIDF affect
query scoring?
• High score: A document with many occurrences of a rare
term
• Low score: A document with few occurrences of a common
term
• TFxIDF depends on the corpus
• A term stops being rare once more documents are added
that contain it
• Documents that don't match a query can still affect the
order of results
46. Query 7: “dog”
Doc 1: “dog” → 0.13
Doc 2: “dog” → 0.13
Doc 3: “dog” → 0.13
We can do a Distributed Frequency Search
GET /test/_search?search_type=dfs_query_then_fetch
{ "query": { "match": { "title": "dog" } } }
47. Replicas And Scoring
• Replicas of the same shard may have different statistics
• Documents marked for deletion but not yet physically
removed (when their segments are merged) still
contribute to statistics
• Replicas may be out of sync re: physical deletion
• Specifying a user or session ID in the shard copy
preference parameter helps route requests to the
same replicas
48. Updates and Scoring
• Updates to an existing document behave like adding a
completely new document as far as DF statistics, until
segments are merged:
• “n, number of documents containing term” increases
• “N, total number of documents with field” increases
49. Updates and Scoring
PUT test/_doc/1
{ "title": "dog cat" }
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
PUT test/_doc/1?refresh
{ "title": "dog zebra" }
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
POST test/_forcemerge
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
“explain": true }
_score: 0.2876821
"n, number of documents containing term”: 1
"N, total number of documents with field”: 1
_score: 0.2876821
"n, number of documents containing term”: 1
"N, total number of documents with field”: 1
_score: 0.18232156
"n, number of documents containing term”: 2
"N, total number of documents with field”: 2
50. Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog” → 0.8
Doc 2: “cat cat cat cat cat cat cat” → 0.8
Doc 3: “dog cat” → 1.2
Matching more query terms is good.
But what also benefits Doc 3 here?
51. Query 4 redux: “dog cat”
Doc 1: “dog dog” → 0.6
Doc 2: “cat cat” → 0.6
Doc 3: “dog cat” → 0.9
Matching more query terms is good
53. Query 4 redux: “dog cat”
Doc 1: {“pet1”: “dog”, “pet2”: “dog”} → 0.87
Doc 2: {“pet1”: ”dog”, “pet2”: “cat”} → 0.87
Matching more query terms within the
same field is good. But there's no
advantage when the matches happen
across fields.
54. Query 4 redux: “dog cat”
Doc 1: {“pet1”:“dog”, “pet2”: “dog”} → 0.18
Doc 2: {“pet1”:”dog”, “pet2”: “cat”} → 0.87
We can simulate a single field using
cross_fields.
GET test/_search
{
"query": {
"multi_match": {
"query": "dog cat",
"fields": [“pet1”, "pet2"],
"type": "cross_fields"
}
}
}
55. Query 8: “orange dog”
Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is always
eager to play. Brown coat. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque nisl. Ut vitae
pellentesque magna, ut tristique nisi. Maecenas ut urna a elit posuere
scelerisque. Suspendisse vel urna turpis. Mauris viverra fermentum
ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus vulputate,
maximus ultricies augue scelerisque.”}
Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally turns
orange.”}
GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", “description"],
"type": "most_fields"
}
}
}
56. Query 8: “orange dog”
Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is
always eager to play. Brown coat. Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque
nisl. Ut vitae pellentesque magna, ut tristique nisi. Maecenas ut urna a elit
posuere scelerisque. Suspendisse vel urna turpis. Mauris viverra
fermentum ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus
vulputate, maximus ultricies augue scelerisque.”} → 1.06
Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally
turns orange.”} → 0.69
“Shortness” is relative to the field's average
59. Case Study: SKUs
I searched for a product by SKU—
I was looking to purchase a shoe—
But the website I used
Seemed very confused
And offered me nothing to view.
123AB-543D-234C
79. Fuzziness and Scoring
PUT /test/_doc/1
{ "title": "dog" }
PUT /test/_doc/2
{ "title": "elephant" }
GET /test/_validate/query?rewrite=true
{
"query": {
"match" : {
"title": {
"query": "dog",
"fuzziness": 2
}}}}
GET /test/_search
{
"query": {
"fuzzy" : {
"title": {
"value": "dog",
"fuzziness": 2,
"rewrite": "constant_score"
}
}
}
}
Query Lucene Query Edits
dog title:dog
dg (title:dog)^0.5 D
do (title:dog)^0.5 D
dgo (title:dog)^0.666666 T
dox (title:dog)^0.666666 S
dogg (title:dog)^0.666666 I
doggg (title:dog)^0.333333 I, I
elepha (title:elephant)^0.6666666 D, D
elephan (title:elephant)^0.85714287 D
elephantt (title:elephant)^0.875 I
elephanttt (title:elephant)^0.75 I, I
86. Case Study: SKUs
I searched for a product by SKU—
I was looking to purchase a shoe.
Results were returned
And the price I soon learned.
“I’ll take one,” I said, “Make it two!”
123AB-543D-234C