This document discusses various techniques used in web search engines for indexing and ranking documents. It covers topics like inverted indices, stopword removal, stemming, relevance feedback, vector space models, and Bayesian inference networks. Web search engines prepare an index of keywords for documents and return ranked lists in response to queries by measuring similarities between query and document vectors based on term frequencies and inverse document frequencies.
This document provides an overview of latest trends in AI and information retrieval. It discusses how search engines work by crawling websites, indexing content, handling user queries, and ranking results. Open-source search solutions and real-world problems in information retrieval are also covered, such as extracting text from web pages and using machine learning for ranking. Emerging areas like learning to rank, query expansion, question answering, and neural information retrieval methods are also summarized. The document concludes by listing some common job roles in the software industry.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
The document is a presentation on information retrieval by Richard Chbeir. It discusses key concepts in information retrieval including definitions of information retrieval, the information retrieval process, query and document processing techniques like stop word removal and stemming, representation models like the Boolean and vector space models, and inverted indexes. Specific topics covered include query representation, document indexing and processing, weighting schemes for terms, and measuring similarity between queries and documents.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
Information retrieval (IR) is the process of searching for and retrieving relevant documents from a large collection based on a user's query. Key aspects of IR include:
- Representing documents and queries in a way that allows measuring their similarity, such as the vector space model.
- Ranking retrieved documents by relevance to the query using factors like term frequency and inverse document frequency.
- Allowing for similarity-based retrieval where documents similar to a given document are retrieved.
This document discusses various techniques for information retrieval (IR), including global and local methods. Global methods reformulate queries while local methods are relative to initial search results. Local methods discussed include relevance feedback, probabilistic relevance feedback, and indirect feedback. The Rocchio algorithm incorporates relevance feedback into the vector space model using cosine similarity. Naive Bayes classification and support vector machines are also covered as techniques for text classification.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
This document provides an overview of latest trends in AI and information retrieval. It discusses how search engines work by crawling websites, indexing content, handling user queries, and ranking results. Open-source search solutions and real-world problems in information retrieval are also covered, such as extracting text from web pages and using machine learning for ranking. Emerging areas like learning to rank, query expansion, question answering, and neural information retrieval methods are also summarized. The document concludes by listing some common job roles in the software industry.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
The document is a presentation on information retrieval by Richard Chbeir. It discusses key concepts in information retrieval including definitions of information retrieval, the information retrieval process, query and document processing techniques like stop word removal and stemming, representation models like the Boolean and vector space models, and inverted indexes. Specific topics covered include query representation, document indexing and processing, weighting schemes for terms, and measuring similarity between queries and documents.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
Information retrieval (IR) is the process of searching for and retrieving relevant documents from a large collection based on a user's query. Key aspects of IR include:
- Representing documents and queries in a way that allows measuring their similarity, such as the vector space model.
- Ranking retrieved documents by relevance to the query using factors like term frequency and inverse document frequency.
- Allowing for similarity-based retrieval where documents similar to a given document are retrieved.
This document discusses various techniques for information retrieval (IR), including global and local methods. Global methods reformulate queries while local methods are relative to initial search results. Local methods discussed include relevance feedback, probabilistic relevance feedback, and indirect feedback. The Rocchio algorithm incorporates relevance feedback into the vector space model using cosine similarity. Naive Bayes classification and support vector machines are also covered as techniques for text classification.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
The document discusses implementing conceptual search in Solr. It describes how conceptual search aims to improve recall without reducing precision by matching documents based on concepts rather than keywords alone. It explains how Word2Vec can be used to learn related concepts from documents and represent words as vectors, which can then be embedded in Solr through synonym filters and payloads to enable conceptual search queries. This allows retrieving more relevant documents that do not contain the exact search terms but are still conceptually related.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
Trey Grainger discusses CareerBuilder's large-scale search platform built on Apache Solr. The platform handles over 150 search servers and indexes over 100 million documents in multiple languages and fields. Grainger describes CareerBuilder's approaches to multi-lingual analysis, custom scoring, and implementing a "Solr cloud" to make search capabilities easily accessible. He also discusses how the search platform is used for knowledge discovery and data analytics applications beyond just search.
The document discusses various techniques for dimensionality reduction and analysis of text data, including latent semantic indexing (LSI), locality preserving indexing (LPI), and probabilistic latent semantic analysis (PLSA). LSI uses singular value decomposition to project documents into a lower-dimensional space while minimizing reconstruction error. LPI aims to preserve local neighborhood structures between similar documents. PLSA models documents as mixtures of underlying latent themes characterized by multinomial word distributions.
Enhancing relevancy through personalization & semantic searchTrey Grainger
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
Search engines, recommendation systems, advertising networks, and even data analytics tools all share the same end goal - to deliver the most relevant information possible to meet a given information need (usually in real-time). Perfecting these systems requires algorithms which can build a deep understanding of the domains represented by the underlying data, understand the nuanced ways in which words and phrases should be parsed and interpreted within different contexts, score the relationships between arbitrary phrases and concepts, continually learn from users' context and interactions to make the system smarter, and generate custom models of personalized tastes for each user of the system.
In this talk, we'll dive into both the philosophical questions associated with such systems ("how do you accurately represent and interpret the meaning of words?", "How do you prevent filter bubbles?", etc.), as well as look at practical examples of how these systems have been successfully implemented in production systems combining a variety of available commercial and open source components (inverted indexes, entity extraction, similarity scoring and machine-learned ranking, auto-generated knowledge graphs, phrase interpretation and concept expansion, etc.).
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
Trey Grainger gave a guest lecture on the intent algorithms of search and recommendation engines. He discussed how search engines work from basic keyword search to more advanced semantic search that incorporates user intent, personalization, and augmented intelligence. Grainger also covered how Lucidworks' products like Apache Solr and Fusion power search for many large companies through highly scalable and customizable search platforms.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
Tovek Tools provides software for discovering information hidden in textual data. It was founded in 1993 in the Czech Republic to help users find, understand, and utilize information through advanced search and analysis tools. Tovek Tools includes desktop applications like Tovek Agent for querying indexes and viewing results as well as a server product for automatically indexing and profiling content from various sources in real-time. The goal is to help analysts reduce the time spent searching, analyzing, and disseminating information from both structured and unstructured data sources.
Reflected intelligence evolving self-learning data systemsTrey Grainger
In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
Visualization approaches in text mining emphasize making large amounts of data easily accessible and identifying patterns within the data. Common visualization tools include simple concept graphs, histograms, line graphs, and circle graphs. These tools allow users to quickly explore relationships within text data and gain insights that may not be apparent from raw text alone. Architecturally, visualization tools are layered on top of text mining systems' core algorithms and allow for modular integration of different visualization front ends.
The document provides an overview of text mining, including:
1. Text mining analyzes unstructured text data through techniques like information extraction, text categorization, clustering, and summarization.
2. It differs from regular data mining as it works with natural language text rather than structured databases.
3. Text mining has various applications including security, biomedicine, software, media, business and more. It faces challenges in representing meaning and context from unstructured text.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
This document discusses concepts and challenges in text retrieval for search engines. It provides an overview of text retrieval and search engine concepts. Some key challenges discussed are semantics and specificity in queries. The document also uses an example of an expert search engine to illustrate a case study. It describes various components involved in text retrieval including document representation, indexing, inverted indexing, retrieval functions and evaluation metrics.
Web search engines index documents and respond to keyword queries by returning a ranked list of relevant documents. Early search engines like Archie allowed searching by title across FTP sites. Modern search engines preprocess documents by removing tags and stopwords, stemming words, and building inverted indexes to map terms to documents for fast retrieval. They evaluate search results using metrics like precision and recall compared to human judgments of relevance.
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.
As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.
Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs
The document discusses text mining and provides examples. It defines text mining as the extraction of implicit knowledge from large amounts of textual data. It discusses applications such as marketing, industry research, and job seeking. Key text mining methods covered include information retrieval, information extraction, web mining, and clustering. The document outlines the text mining process and discusses text characteristics, learning methods such as classification and clustering, and evaluation metrics. Examples are provided to illustrate classification using decision trees and k-nearest neighbors on structured and unstructured text data.
Web search engines index documents by creating an inverted index of keywords to map keyword queries to relevant documents. Early search engines like Archie provided title searches across FTP sites. Modern search engines implement techniques like Boolean, proximity and phrase queries. Documents are preprocessed by removing tags and stemming words before being stored in an inverted index. Ranking search results involves estimating the relevance of documents to queries using models like vector space and probabilistic models like Bayesian inference networks. Relevance feedback allows refining queries based on user selections from initial search results. Evaluation measures recall and precision by comparing system search results to manually identified ground truths.
This document provides an introduction to text mining and information retrieval. It discusses how text mining is used to extract knowledge and patterns from unstructured text sources. The key steps of text mining include preprocessing text, applying techniques like summarization and classification, and analyzing the results. Text databases and information retrieval systems are described. Various models and techniques for text retrieval are outlined, including Boolean, vector space, and probabilistic models. Evaluation measures like precision and recall are also introduced.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
The document discusses implementing conceptual search in Solr. It describes how conceptual search aims to improve recall without reducing precision by matching documents based on concepts rather than keywords alone. It explains how Word2Vec can be used to learn related concepts from documents and represent words as vectors, which can then be embedded in Solr through synonym filters and payloads to enable conceptual search queries. This allows retrieving more relevant documents that do not contain the exact search terms but are still conceptually related.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
Trey Grainger discusses CareerBuilder's large-scale search platform built on Apache Solr. The platform handles over 150 search servers and indexes over 100 million documents in multiple languages and fields. Grainger describes CareerBuilder's approaches to multi-lingual analysis, custom scoring, and implementing a "Solr cloud" to make search capabilities easily accessible. He also discusses how the search platform is used for knowledge discovery and data analytics applications beyond just search.
The document discusses various techniques for dimensionality reduction and analysis of text data, including latent semantic indexing (LSI), locality preserving indexing (LPI), and probabilistic latent semantic analysis (PLSA). LSI uses singular value decomposition to project documents into a lower-dimensional space while minimizing reconstruction error. LPI aims to preserve local neighborhood structures between similar documents. PLSA models documents as mixtures of underlying latent themes characterized by multinomial word distributions.
Enhancing relevancy through personalization & semantic searchTrey Grainger
Matching keywords is just step one in the effort to maximize the relevancy of your search platform. In this talk, you'll learn how to implement advanced relevancy techniques which enable your search platform to "learn" from your content and users' behavior. Topics will include automatic synonym discovery, latent semantic indexing, payload scoring, document-to-document searching, foreground vs. background corpus analysis for interesting term extraction, collaborative filtering, and mining user behavior to drive geographically and conceptually personalized search results. You'll learn how CareerBuilder has enhanced Solr (also utilizing Hadoop) to dynamically discover relationships between data and behavior, and how you can implement similar techniques to greatly enhance the relevancy of your search platform.
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
Search engines, recommendation systems, advertising networks, and even data analytics tools all share the same end goal - to deliver the most relevant information possible to meet a given information need (usually in real-time). Perfecting these systems requires algorithms which can build a deep understanding of the domains represented by the underlying data, understand the nuanced ways in which words and phrases should be parsed and interpreted within different contexts, score the relationships between arbitrary phrases and concepts, continually learn from users' context and interactions to make the system smarter, and generate custom models of personalized tastes for each user of the system.
In this talk, we'll dive into both the philosophical questions associated with such systems ("how do you accurately represent and interpret the meaning of words?", "How do you prevent filter bubbles?", etc.), as well as look at practical examples of how these systems have been successfully implemented in production systems combining a variety of available commercial and open source components (inverted indexes, entity extraction, similarity scoring and machine-learned ranking, auto-generated knowledge graphs, phrase interpretation and concept expansion, etc.).
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
Trey Grainger gave a guest lecture on the intent algorithms of search and recommendation engines. He discussed how search engines work from basic keyword search to more advanced semantic search that incorporates user intent, personalization, and augmented intelligence. Grainger also covered how Lucidworks' products like Apache Solr and Fusion power search for many large companies through highly scalable and customizable search platforms.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
Tovek Tools provides software for discovering information hidden in textual data. It was founded in 1993 in the Czech Republic to help users find, understand, and utilize information through advanced search and analysis tools. Tovek Tools includes desktop applications like Tovek Agent for querying indexes and viewing results as well as a server product for automatically indexing and profiling content from various sources in real-time. The goal is to help analysts reduce the time spent searching, analyzing, and disseminating information from both structured and unstructured data sources.
Reflected intelligence evolving self-learning data systemsTrey Grainger
In this presentation, we’ll talk about evolving self-learning search and recommendation systems which are able to accept user queries, deliver relevance-ranked results, and iteratively learn from the users’ subsequent interactions to continually deliver a more relevant experience. Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the collective feedback from all prior user interactions with the system. Through iterative feedback loops, such a system can leverage user interactions to learn the meaning of important phrases and topics within a domain, identify alternate spellings and disambiguate multiple meanings of those phrases, learn the conceptual relationships between phrases, and even learn the relative importance of features to automatically optimize its own ranking algorithms on a per-query, per-category, or per-user/group basis.
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
Visualization approaches in text mining emphasize making large amounts of data easily accessible and identifying patterns within the data. Common visualization tools include simple concept graphs, histograms, line graphs, and circle graphs. These tools allow users to quickly explore relationships within text data and gain insights that may not be apparent from raw text alone. Architecturally, visualization tools are layered on top of text mining systems' core algorithms and allow for modular integration of different visualization front ends.
The document provides an overview of text mining, including:
1. Text mining analyzes unstructured text data through techniques like information extraction, text categorization, clustering, and summarization.
2. It differs from regular data mining as it works with natural language text rather than structured databases.
3. Text mining has various applications including security, biomedicine, software, media, business and more. It faces challenges in representing meaning and context from unstructured text.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
This document discusses concepts and challenges in text retrieval for search engines. It provides an overview of text retrieval and search engine concepts. Some key challenges discussed are semantics and specificity in queries. The document also uses an example of an expert search engine to illustrate a case study. It describes various components involved in text retrieval including document representation, indexing, inverted indexing, retrieval functions and evaluation metrics.
Web search engines index documents and respond to keyword queries by returning a ranked list of relevant documents. Early search engines like Archie allowed searching by title across FTP sites. Modern search engines preprocess documents by removing tags and stopwords, stemming words, and building inverted indexes to map terms to documents for fast retrieval. They evaluate search results using metrics like precision and recall compared to human judgments of relevance.
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.
As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.
Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs
The document discusses text mining and provides examples. It defines text mining as the extraction of implicit knowledge from large amounts of textual data. It discusses applications such as marketing, industry research, and job seeking. Key text mining methods covered include information retrieval, information extraction, web mining, and clustering. The document outlines the text mining process and discusses text characteristics, learning methods such as classification and clustering, and evaluation metrics. Examples are provided to illustrate classification using decision trees and k-nearest neighbors on structured and unstructured text data.
Web search engines index documents by creating an inverted index of keywords to map keyword queries to relevant documents. Early search engines like Archie provided title searches across FTP sites. Modern search engines implement techniques like Boolean, proximity and phrase queries. Documents are preprocessed by removing tags and stemming words before being stored in an inverted index. Ranking search results involves estimating the relevance of documents to queries using models like vector space and probabilistic models like Bayesian inference networks. Relevance feedback allows refining queries based on user selections from initial search results. Evaluation measures recall and precision by comparing system search results to manually identified ground truths.
This document provides an introduction to text mining and information retrieval. It discusses how text mining is used to extract knowledge and patterns from unstructured text sources. The key steps of text mining include preprocessing text, applying techniques like summarization and classification, and analyzing the results. Text databases and information retrieval systems are described. Various models and techniques for text retrieval are outlined, including Boolean, vector space, and probabilistic models. Evaluation measures like precision and recall are also introduced.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academic research in this area has focused primarily on techniques for extracting geographic knowledge from the web. In this paper, we study the problem of efficient query processing in scalable geographic search engines. Query processing is a major bottleneck in standard web search engines, and the main reason for the thousands of machines used by the major engines. Geographic search engine query processing is different in that it requires a combination of text and spatial data processing techniques. We propose several algorithms for efficient query processing in geographic search engines, integrate them into an existing web search query processor, and evaluate them on large sets of real data and query traces.
Set Similarity Search using a Distributed Prefix Tree IndexHPCC Systems
The document describes an approach for set similarity search using a distributed prefix tree index. It begins by introducing the problem of set similarity search and examples of similarity functions like Jaccard similarity. It then reviews existing approaches like inverted indexes and introduces a new approach using a prefix tree to index the record sets. The remainder of the document discusses implementing and testing the prefix tree approach on various datasets and analyzing the results. It finds that the token order in the prefix tree impacts performance and that adding the level as an additional index key improves query runtime. The prefix tree approach generally outperforms inverted indexes at high similarity thresholds.
This document summarizes key concepts in information retrieval systems and algorithms for large data sets. It discusses the differences between information retrieval and data retrieval systems. It also describes several classic models for relevance ranking in IR, including the Boolean model and vector space model. The document outlines topics like text processing, indexing, searching, and evaluation in information retrieval systems.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
The document describes the process of building an inverted index for information retrieval. Key points:
- Documents are parsed to extract terms which are sorted in a vocabulary file along with document frequency and collection frequency.
- A postings file stores the document IDs and term frequencies for each unique term. This separates the small vocabulary file for fast searching from the large postings file.
- The process involves tokenizing documents, removing stopwords, stemming terms, and counting term frequencies to build the inverted index files for efficient searching of documents based on terms.
The EXTRA classifier is a scalable solution based on recent advances in Natural Language Processing (NLP). The foundational concept of the EXTRA classifier is transfer learning, a machine learning process that enables the relatively low-cost specialization of a pre-trained language model to a specific task in a specific domain with far fewer training examples compared to standard machine learning solutions.
More specifically, the EXTRA classifier leverages BERT, a well-known pre-trained autoencoding language model that has revolutionized the NLP space in the past few years. BERT provides contextual embeddings, i.e., it provides context-aware vector representations of words that capture semantics far more efficiently than their context-free counterparts.
The EXTRA classifier contains a pre-processing module to cope with the inevitable noise in the output of standard Optical Character Recognition systems. The pre-processed plain text from a source document is then fed into a BERT-based classifier, which is built by extending pre-trained BERT with an additional linear layer trained for classification through a process commonly known as fine-tuning.
We will present preliminary results that confirm some clear benefits with respect to rule-based solutions in terms of classification performance and system scalability.
CityLABS Workshop: Working with large tablesEnrico Daga
This document discusses working with large tables and big data processing. It introduces distributed computing as an approach to process large datasets by distributing data across multiple nodes and parallelizing operations. The document then outlines using Apache Hadoop and the MK Data Hub cluster to distribute data storage and processing. It demonstrates how to use tools like Hue, Hive, and Pig to analyze tabular data in a distributed manner at scale. Finally, hands-on examples are provided for computing TF-IDF statistics on the large Gutenberg text corpus.
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on October 14, 2012.
http://www.socalcodecamp.com/session.aspx?sid=a4774b3c-7a2d-45db-8721-f54c5a314e17
This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.
The document discusses information retrieval (IR) models, including the Boolean, vector space, and probabilistic models. The Boolean model represents documents and queries as sets of index terms and determines relevance through binary term presence, while the vector space model represents documents and queries as weighted vectors in a multidimensional space and ranks documents by calculating similarity between document and query vectors. The probabilistic model determines relevance probabilities based on the likelihood of terms appearing in relevant vs. non-relevant documents.
Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.
This document introduces distributed computing and tools for processing large tabular data using the Big Data Cluster. It discusses how distributed computing allows tabular data to be replicated across nodes and computation to be parallelized. It then provides an overview of Hadoop and how the Big Data Cluster can be used with tools like Hue, Hive, and Pig to perform analytics on large datasets. Finally, it walks through an example of computing TF-IDF scores on a corpus of text documents from Project Gutenberg.
This document discusses file organizations and indexing in databases. It describes different file organization approaches like heap files, sorted files, and clustered files. It also discusses alternative representations for data entries in indexes and different types of indexes like clustered vs unclustered indexes and single key vs composite indexes. The document provides analysis of the costs for different database operations under various file organizations and indexes.
This session covers topics related to data archiving and sharing. This includes data formats, metadata, controlled vocabularies, preservation, archiving and repositories.
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012.
http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
2. Web search engines
Rooted in Information Retrieval (IR) systems
•Prepare a keyword index for corpus
•Respond to keyword queries with a ranked list of
documents.
ARCHIE
•Earliest application of rudimentary IR systems to
the Internet
•Title search across sites serving files over FTP
3. 3
Boolean queries: Examples
Simple queries involving relationships
between terms and documents
• Documents containing the word Java
• Documents containing the word Java but not
the word coffee
Proximity queries
• Documents containing the phrase Java beans
or the term API
• Documents where Java and island occur in
the same sentence
4. 4
Document preprocessing
Tokenization
• Filtering away tags
• Tokens regarded as nonempty sequence of
characters excluding spaces and
punctuations.
• Token represented by a suitable integer, tid,
typically 32 bits
• Optional: stemming/conflation of words
• Result: document (did) transformed into a
sequence of integers (tid, pos)
5. 5
Storing tokens
Straight-forward implementation using a
relational database
• Example figure
• Space scales to almost 10 times
Accesses to table show common pattern
• reduce the storage by mapping tids to a
lexicographically sorted buffer of (did, pos)
tuples.
• Indexing = transposing document-term matrix
6. 6
Two variants of the inverted index data structure, usually stored on disk. The simpler
version in the middle does not store term offset information; the version to the right stores
term
offsets. The mapping from terms to documents and positions (written as
“document/position”) may
be implemented using a B-tree or a hash-table.
7. 7
Storage
For dynamic corpora
• Berkeley DB2 storage manager
• Can frequently add, modify and delete
documents
For static collections
• Index compression techniques (to be
discussed)
8. 8
Stopwords
Function words and connectives
Appear in large number of documents and little
use in pinpointing documents
Indexing stopwords
• Stopwords not indexed
For reducing index space and improving performance
• Replace stopwords with a placeholder (to remember
the offset)
Issues
• Queries containing only stopwords ruled out
• Polysemous words that are stopwords in one sense
but not in others
E.g.; can as a verb vs. can as a noun
9. 9
Stemming
Conflating words to help match a query term with a
morphological variant in the corpus.
Remove inflections that convey parts of speech, tense
and number
E.g.: university and universal both stem to universe.
Techniques
• morphological analysis (e.g., Porter's algorithm)
• dictionary lookup (e.g., WordNet).
Stemming may increase recall but at the price of
precision
• Abbreviations, polysemy and names coined in the technical and
commercial sectors
• E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to
“gate”, may be bad !
10. 10
Batch indexing and updates
Incremental indexing
• Time-consuming due to random disk IO
• High level of disk block fragmentation
Simple sort-merges.
• To replace the indexed update of variable-
length postings
For a dynamic collection
• single document-level change may need to
update hundreds to thousands of records.
• Solution : create an additional “stop-press”
index.
12. 12
Stop-press index
Collection of document in flux
• Model document modification as deletion followed by insertion
• Documents in flux represented by a signed record (d,t,s)
• “s” specifies if “d” has been deleted or inserted.
Getting the final answer to a query
• Main index returns a document set D0.
• Stop-press index returns two document sets
D+ : documents not yet indexed in D0 matching the query
D- : documents matching the query removed from the collection
since D0 was constructed.
Stop-press index getting too large
• Rebuild the main index
signed (d, t, s) records are sorted in (t, d, s) order and merge-
purged into the master (t, d) records
• Stop-press index can be emptied out.
13. 13
Index compression techniques
Compressing the index so that much of it
can be held in memory
• Required for high-performance IR installations
(as with Web search engines),
Redundancy in index storage
• Storage of document IDs.
Delta encoding
• Sort Doc IDs in increasing order
• Store the first ID in full
• Subsequently store only difference (gap) from
previous ID
14. 14
Encoding gaps
Small gap must cost far fewer bits than a
document ID.
Binary encoding
• Optimal when all symbols are equally likely
Unary code
• optimal if probability of large gaps decays
exponentially
15. 15
Encoding gaps
Gamma code
• Represent gap x as
Unary code for followed by
represented in binary ( bits)
Golomb codes
• Further enhancement
logx
1
logx
2
-
x
logx
16. 16
Lossy compression mechanisms
Trading off space for time
collect documents into buckets
• Construct inverted index from terms to bucket
IDs
• Document' IDs shrink to half their size.
Cost: time overheads
• For each query, all documents in that bucket
need to be scanned
Solution: index documents in each bucket
separately
• E.g.: Glimpse (http://tuit.uz/)
17. 17
General dilemmas
Messy updates vs. High compression rate
Storage allocation vs. Random I/Os
Random I/O vs. large scale
implementation
18. 18
Relevance ranking
Keyword queries
• In natural language
• Not precise, unlike SQL
Boolean decision for response unacceptable
• Solution
Rate each document for how likely it is to satisfy the user's
information need
Sort in decreasing order of the score
Present results in a ranked list.
No algorithmic way of ensuring that the ranking
strategy always favors the information need
• Query: only a part of the user's information need
19. 19
Responding to queries
Set-valued response
• Response set may be very large
(E.g., by recent estimates, over 12 million Web
pages contain the word java.)
Demanding selective query from user
Guessing user's information need and
ranking responses
Evaluating rankings
20. 20
Evaluating procedure
Given benchmark
• Corpus of n documents D
• A set of queries Q
• For each query, an exhaustive set of
relevant documents identified
manually
Query submitted system
• Ranked list of documents
retrieved
• compute a 0/1 relevance list
iff
otherwise.
Q
q
D
Dq
)
d
,
,
d
,
(d n
2
1
)
r
..,
,
r
,
(r n
2
1
D
d q
i
1
ri
0
ri
21. 21
Recall and precision
Recall at rank
• Fraction of all relevant documents included in
.
• .
Precision at rank
• Fraction of the top k responses that are
actually relevant.
• .
1
k
)
d
,
,
d
,
(d n
2
1
k
i
1
i
q
r
|
D
|
1
recall(k)
k
i
1
i
r
k
1
k)
precision(
22. 22
Other measures
Average precision
• Sum of precision at each relevant hit position in the
response list, divided by the total number of relevant
documents
• .
.
• avg.precision =1 iff engine retrieves all relevant
documents and ranks them ahead of any irrelevant
document
Interpolated precision
• To combine precision values from multiple queries
• Gives precision-vs.-recall curve for the benchmark.
For each query, take the maximum precision obtained for the
query for any recall greater than or equal to
average them together for all queries
Others like measures of authority, prestige etc
|
|
k
1
k
q
)
(
*
r
|
D
|
1
ion
avg.precis
D
k
precision
23. 23
Precision-Recall tradeoff
Interpolated precision cannot increase with
recall
• Interpolated precision at recall level 0 may be less
than 1
At level k = 0
• Precision (by convention) = 1, Recall = 0
Inspecting more documents
• Can increase recall
• Precision may decrease
we will start encountering more and more irrelevant
documents
Search engine with a good ranking function will
generally show a negative relation between
recall and precision.
24. 24
ecision and interpolated precision plotted against recall for the given relevance vec
Missing are zeroes.
k
r
25. 25
The vector space model
Documents represented as vectors in a
multi-dimensional Euclidean space
• Each axis = a term (token)
Coordinate of document d in direction of
term t determined by:
• Term frequency TF(d,t)
number of times term t occurs in document d,
scaled in a variety of ways to normalize document
length
• Inverse document frequency IDF(t)
to scale down the coordinates of terms that occur
in many documents
26. 26
Term frequency
.
.
Cornell SMART system uses a smoothed
version
)
n(d,
t)
n(d,
t)
TF(d,
))
(n(d,
max
t)
n(d,
t)
TF(d,
))
,
(
1
log(
1
)
,
(
0
)
,
(
t
d
n
t
d
TF
t
d
TF
otherwise
t
d
n 0
)
,
(
27. 27
Inverse document frequency
Given
• D is the document collection and is the set
of documents containing t
Formulae
• mostly dampened functions of
• SMART
.
|
| t
D
D
)
|
|
|
|
1
log(
)
(
t
D
D
t
IDF
t
D
28. 28
Vector space model
Coordinate of document d in axis t
• .
• Transformed to in the TFIDF-space
Query q
• Interpreted as a document
• Transformed to in the same TFIDF-space
as d
)
(
)
,
( t
IDF
t
d
TF
dt
d
q
29. 29
Measures of proximity
Distance measure
• Magnitude of the vector difference
.
• Document vectors must be normalized to unit
( or ) length
Else shorter documents dominate (since queries
are short)
Cosine similarity
• cosine of the angle between and
Shorter documents are penalized
|
| q
d
1
L
2
L
d
q
30. 30
Relevance feedback
Users learning how to modify queries
• Response list must have least some relevant
documents
• Relevance feedback
`correcting' the ranks to the user's taste
automates the query refinement process
Rocchio's method
• Folding-in user feedback
• To query vector
Add a weighted sum of vectors for relevant documents D+
Subtract a weighted sum of the irrelevant documents D-
• .
q
D -
D
d
-
d
q
'
q
31. 31
Relevance feedback (contd.)
Pseudo-relevance feedback
• D+ and D- generated automatically
E.g.: Cornell SMART system
top 10 documents reported by the first round of
query execution are included in D+
• typically set to 0; D- not used
Not a commonly available feature
• Web users want instant gratification
• System complexity
Executing the second round query slower and
expensive for major search engines
32. 32
Ranking by odds ratio
R : Boolean random variable which
represents the relevance of document d
w.r.t. query q.
Ranking documents by their odds ratio for
relevance
• .
Approximating probability of d by product
of the probabilities of individual terms in d
• .
• Approximately…
)
,
|
Pr(
/
)
|
Pr(
)
,
|
Pr(
/
)
|
Pr(
)
,
Pr(
/
)
,
,
Pr(
)
,
Pr(
/
)
,
,
Pr(
)
,
|
Pr(
)
,
|
Pr(
q
R
d
q
R
q
R
d
q
R
d
q
d
q
R
d
q
d
q
R
d
q
R
d
q
R
t t
t
q
R
x
q
R
x
q
R
d
q
R
d
)
,
|
Pr(
)
,
|
Pr(
)
,
|
Pr(
)
,
|
Pr(
d
q
t q
t
q
t
q
t
q
t
a
b
b
a
d
q
R
d
q
R
)
1
(
)
1
(
)
,
|
Pr(
)
,
|
Pr(
,
,
,
,
33. 33
Bayesian Inferencing
Bayesian inference network for relevance ranking. A
document is relevant to the extent that setting its
corresponding belief node to true lets us assign a high
degree of belief in the node corresponding to the query.
Manual specification of
mappings between terms
to approximate concepts.
34. 34
Bayesian Inferencing (contd.)
Four layers
1.Document layer
2.Representation layer
3.Query concept layer
4.Query
Each node is associated with a random
Boolean variable, reflecting belief
Directed arcs signify that the belief of a
node is a function of the belief of its
immediate parents (and so on..)
35. 35
Bayesian Inferencing systems
2 & 3 same for basic vector-space IR
systems
Verity's Search97
• Allows administrators and users to define
hierarchies of concepts in files
Estimation of relevance of a document d
w.r.t. the query q
• Set the belief of the corresponding node to 1
• Set all other document beliefs to 0
• Compute the belief of the query
• Rank documents in decreasing order of belief
that they induce in the query
36. 36
Other issues
Spamming
• Adding popular query terms to a page unrelated to
those terms
• E.g.: Adding “Hawaii vacation rental” to a page about
“Internet gambling”
• Little setback due to hyperlink-based ranking
Titles, headings, meta tags and anchor-text
• TFIDF framework treats all terms the same
• Meta search engines:
Assign weight age to text occurring in tags, meta-tags
• Using anchor-text on pages u which link to v
Anchor-text on u offers valuable editorial judgment about v as
well.
37. 37
Other issues (contd..)
Including phrases to rank complex queries
• Operators to specify word inclusions and
exclusions
• With operators and phrases
queries/documents can no longer be treated
as ordinary points in vector space
Dictionary of phrases
• Could be cataloged manually
• Could be derived from the corpus itself using
statistical techniques
• Two separate indices:
one for single terms and another for phrases
38. 38
Corpus derived phrase dictionary
Two terms and
Null hypothesis = occurrences of and are
independent
To the extent the pair violates the null hypothesis, it is
likely to be a phrase
• Measuring violation with likelihood ratio of the
hypothesis
• Pick phrases that violate the null hypothesis
with large confidence
Contingency table built from statistics
1
t
2
t
1
t
2
t
)
,
(
)
,
(
)
,
(
)
,
(
2
1
11
2
1
10
2
1
01
2
1
00
t
t
k
k
t
t
k
k
t
t
k
k
t
t
k
k
39. 39
Corpus derived phrase dictionary
Hypotheses
• Null hypothesis
• Alternative hypothesis
• Likelihood ratio
)
;
(
max
)
;
(
max
0
k
p
H
k
p
H
p
p
11
10
01
00
)
(
))
1
(
(
)
)
1
((
))
1
)(
1
((
)
,
,
,
;
,
( 2
1
2
1
2
1
2
1
11
10
01
00
2
1
k
k
k
k
p
p
p
p
p
p
p
p
k
k
k
k
p
p
H
11
10
01
00
11
10
01
00
11
10
01
00
11
10
01
00 )
,
,
,
;
,
,
,
( k
k
k
k
p
p
p
p
k
k
k
k
p
p
p
p
H
40. 40
Approximate string matching
Non-uniformity of word spellings
• dialects of English
• transliteration from other languages
Two ways to reduce this problem.
1. Aggressive conflation mechanism to
collapse variant spellings into the same
token
2. Decompose terms into a sequence of q-
grams or sequences of q characters
41. 41
Approximate string matching
1. Aggressive conflation mechanism to collapse
variant spellings into the same token
• E.g.: Soundex : takes phonetics and pronunciation details
into account
• used with great success in indexing and searching last
names in census and telephone directory data.
2. Decompose terms into a sequence of q-grams
or sequences of q characters
• Check for similarity in the grams
• Looking up the inverted index : a two-stage affair:
• Smaller index of q-grams consulted to expand each query
term into a set of slightly distorted query terms
• These terms are submitted to the regular index
• Used by Google for spelling correction
• Idea also adopted for eliminating near-duplicate pages
)
4
2
(
q
q
42. 42
Meta-search systems
• Take the search engine to the document
• Forward queries to many geographically distributed
repositories
• Each has its own search service
• Consolidate their responses.
• Advantages
• Perform non-trivial query rewriting
• Suit a single user query to many search engines with
different query syntax
• Surprisingly small overlap between crawls
• Consolidating responses
• Function goes beyond just eliminating duplicates
• Search services do not provide standard ranks which
can be combined meaningfully
43. 43
Similarity search
• Cluster hypothesis
• Documents similar to relevant documents are
also likely to be relevant
• Handling “find similar” queries
• Replication or duplication of pages
• Mirroring of sites
44. Mining the Web Chakrabarti and Ramakrishnan 44
Document similarity
• Jaccard coefficient of similarity between
document and
• T(d) = set of tokens in document d
• .
• Symmetric, reflexive, not a metric
• Forgives any number of occurrences and any
permutations of the terms.
• is a metric
1
d 2
d
|
)
(
)
(
|
|
)
(
)
(
|
)
,
(
'
2
1
2
1
2
1
d
T
d
T
d
T
d
T
d
d
r
)
,
(
'
1 2
1 d
d
r
45. 45
Estimating Jaccard coefficient with
random permutations
1. Generate a set of m random
permutations
2. for each do
3. compute and
4. check if
5. end for
6. if equality was observed in k cases,
estimate.
m
k
d
d
r
)
,
(
' 2
1
)
(
min
)
(
min 2
1 d
T
d
T
)
( 2
d
)
( 1
d
46. 46
Fast similarity search with random
permutations
1. for each random permutation do
2. create a file
3. for each document d do
4. write out to
5. end for
6. sort using key s--this results in contiguous blocks with fixed
s containing all associated
7. create a file
8. for each pair within a run of having a given s do
9. write out a document-pair record to g
10. end for
11. sort on key
12. end for
13. merge for all in order, counting the number of
entries
)
,
( 2
1 d
d
s
d
d
d
T
s )),
(
(
min
f
f
f
g
f
)
,
( 2
1 d
d
g )
,
( 2
1 d
d
g )
,
( 2
1 d
d )
,
( 2
1 d
d
47. 47
Eliminating near-duplicates via shingling
• “Find-similar” algorithm reports all duplicate/near-
duplicate pages
• Eliminating duplicates
• Maintain a checksum with every page in the corpus
• Eliminating near-duplicates
• Represent each document as a set T(d) of q-grams (shingles)
• Find Jaccard similarity between and
• Eliminate the pair from step 9 if it has similarity above a
threshold
1
d
)
,
( 2
1 d
d
r 2
d
48. 48
Detecting locally similar sub-graphs of the
Web
• Similarity search and duplicate elimination on the
graph structure of the web
• To improve quality of hyperlink-assisted ranking
• Detecting mirrored sites
• Approach 1 [Bottom-up Approach]
1. Start process with textual duplicate detection
• cleaned URLs are listed and sorted to find duplicates/near-
duplicates
• each set of equivalent URLs is assigned a unique token ID
• each page is stripped of all text, and represented as a sequence
of outlink IDs
2. Continue using link sequence representation
3. Until no further collapse of multiple URLs are possible
• Approach 2 [Bottom-up Approach]
1. identify single nodes which are near duplicates (using text-
shingling)
2. extend single-node mirrors to two-node mirrors
3. continue on to larger and larger graphs which are likely mirrors of
one another
49. 49
Detecting mirrored sites (contd.)
• Approach 3 [Step before fetching all pages]
• Uses regularity in URL strings to identify host-pairs which are
mirrors
• Preprocessing
• Host are represented as sets of positional bigrams
• Convert host and path to all lowercase characters
• Let any punctuation or digit sequence be a token separator
• Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)
• Eliminate stop terms such as htm, html, txt, main, index, home,
bin, cgi
• Form positional bigrams from the token sequence
• Two hosts are said to be mirrors if
• A large fraction of paths are valid on both web sites
• These common paths link to pages that are near-duplicates.