Zipf's law describes the frequency distribution of words in natural language corpora. It states that the frequency of any word is inversely proportional to its rank in the frequency table. Most words have low frequency, while a few words are used very frequently. Heap's law estimates how vocabulary size grows with corpus size, at a sub-linear rate. Text preprocessing techniques like stopword removal and stemming aim to reduce noise by excluding non-discriminative words from indexes.
This document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It provides an overview of different retrieval models such as Boolean, vector space, probabilistic, and language models. It then describes using a probabilistic model that estimates the probability of a document being relevant or non-relevant given its terms. This model can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge to different document fields during ranking to improve relevance.
Yang Yu is proposing research on improving machine learning based ontology mapping by automatically obtaining training samples from the web. The proposed system would parse two input ontologies to generate queries to search engines and collect documents to use as samples for each ontology class. These samples would then be used to train text classifiers, which would produce probabilistic mappings between classes in the two ontologies. The results would be evaluated by comparing to mappings from human experts. Current work involves exploring alternative text classification tools and ways to utilize the probabilistic mapping values generated by the classifiers.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
The document provides an overview of graph databases and RDF databases, describing how graph databases store data as nodes and relationships while RDF databases use a triple store model of subject-predicate-object statements to represent data. Examples are given demonstrating how to model and query data using both graph and RDF databases. The document also discusses scientific articles related to querying RDF data from a graph database perspective.
Word embeddings are a technique for converting words into vectors of numbers so that they can be processed by machine learning algorithms. Words with similar meanings are mapped to similar vectors in the vector space. There are two main types of word embedding models: count-based models that use co-occurrence statistics, and prediction-based models like CBOW and skip-gram neural networks that learn embeddings by predicting nearby words. Word embeddings allow words with similar contexts to have similar vector representations, and have applications such as document representation.
This document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It provides an overview of different retrieval models such as Boolean, vector space, probabilistic, and language models. It then describes using a probabilistic model that estimates the probability of a document being relevant or non-relevant given its terms. This model can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge to different document fields during ranking to improve relevance.
Yang Yu is proposing research on improving machine learning based ontology mapping by automatically obtaining training samples from the web. The proposed system would parse two input ontologies to generate queries to search engines and collect documents to use as samples for each ontology class. These samples would then be used to train text classifiers, which would produce probabilistic mappings between classes in the two ontologies. The results would be evaluated by comparing to mappings from human experts. Current work involves exploring alternative text classification tools and ways to utilize the probabilistic mapping values generated by the classifiers.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
The document provides an overview of graph databases and RDF databases, describing how graph databases store data as nodes and relationships while RDF databases use a triple store model of subject-predicate-object statements to represent data. Examples are given demonstrating how to model and query data using both graph and RDF databases. The document also discusses scientific articles related to querying RDF data from a graph database perspective.
Word embeddings are a technique for converting words into vectors of numbers so that they can be processed by machine learning algorithms. Words with similar meanings are mapped to similar vectors in the vector space. There are two main types of word embedding models: count-based models that use co-occurrence statistics, and prediction-based models like CBOW and skip-gram neural networks that learn embeddings by predicting nearby words. Word embeddings allow words with similar contexts to have similar vector representations, and have applications such as document representation.
The CIDOC Conceptual Reference Model (CRM) provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation.
The CIDOC CRM is intended to promote a shared understanding of cultural heritage information by providing a common and extensible semantic framework that any cultural heritage information can be mapped to. It is intended to be a common language for domain experts and implementers to formulate requirements for information systems and to serve as a guide for good practice of conceptual modelling. In this way, it can provide the "semantic glue" needed to mediate between different sources of cultural heritage information, such as that published by museums, libraries and archives.
The CIDOC CRM is the culmination of over 10 years work by the CIDOC Documentation Standards Working Group and CIDOC CRM SIG which are working groups of CIDOC. Since 9/12/2006 it is official standard ISO 21127:2006.
Also available as presentation slides (with audio narration), slide-notes and video:
https://cidoc-crm.org/cidoc-crm-tutorial
Video Implementation
Centre for Cultural Informatics, Information Systems Laboratory
Foundation for Research and Technology
https://www.ics.forth.gr/isl/cci
Authoring and Presentation
Stephen Stead, Archaeologist, founder member of CIDOC CRM
Copyright CIDOC CRM, 2008
http://www.cidoc-crm.org/
The document discusses XML signatures, which provide integrity and non-repudiation for XML documents. It describes the components of an XML signature such as SignedInfo, SignatureValue and KeyInfo. It explains different types of XML signatures like enveloping, enveloped and detached signatures. It provides examples of XML signatures and answers questions about canonicalization, transforms, references and other concepts related to XML signatures.
The document discusses signature files, which are used for document retrieval. A signature file creates a compressed representation or "signature" for each document in a database. These signatures are stored in hash tables to allow easy retrieval of matching documents for user queries. Signatures can represent words using triplets of characters and a hash function, or entire documents through concatenation of word signatures or superimposed coding. Signature files provide a quick link between queries and documents but have lower accuracy than inverted files, which are generally better for information retrieval applications.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
An inverted file indexes a text collection to speed up searching. It contains a vocabulary of distinct words and occurrences lists with information on where each word appears. For each term in the vocabulary, it stores a list of pointers to occurrences called an inverted list. Coarser granularity indexes use less storage but require more processing, while word-level indexes enable proximity searches but use more space. The document describes how inverted files are structured and constructed from text and discusses techniques like block addressing that reduce their space requirements.
This document provides an overview of the Web Ontology Language (OWL). It discusses the requirements for ontology languages, the three species of OWL (Lite, DL, Full), the syntactic forms of OWL, and key elements of OWL including classes, properties, restrictions, and boolean combinations. It also covers special properties, datatypes, and versioning information. OWL builds on RDF and RDF Schema to provide a stronger language for defining ontologies with greater machine interpretability on the semantic web.
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...confluent
(Bernd Ruecker, Camunda) Kafka Summit SF 2018
A company’s business processes typically span more than one microservice. In an e-commerce company, for example, a customer order might involve microservices for payments, inventory, shipping and more. Implementing long-running, asynchronous and complex collaboration of distributed microservices is challenging. How can we ensure visibility of cross-microservice flows and provide status and error monitoring? How do we guarantee that overall flows always complete, even if single services fail? Or, how do we at least recognize stuck flows so that we can fix them?
In this talk, I’ll demonstrate an approach based on real-life projects using the open source workflow engine zeebe.io to orchestrate microservices. Zeebe can connect to Kafka to coordinate workflows that span many microservices, providing end-to-end process visibility without violating the principles of loose coupling and service independence. Once an orchestration flow starts, Zeebe ensures that it is eventually carried out, retrying steps upon failure. In a Kafka architecture, Zeebe can easily produce events (or commands) and subscribe to events that will be correlated to workflows. Along the way, Zeebe facilitates monitoring and visibility into the progress and status of orchestration flows. Internally, Zeebe works as a distributed, event-driven and event-sourced system, making it not only very fast but horizontally scalable and fault tolerant—and able to handle the throughput required to operate alongside Kafka in a microservices architecture. Expect not only slides but also fun little live-hacking sessions and real-life stories.
This document discusses project management techniques for network analysis, specifically Critical Path Method (CPM) and Program Evaluation and Review Technique (PERT). It defines key concepts like activities, events, predecessors, successors. It provides an example network diagram and shows how to calculate earliest start/finish times, latest start/finish times, and float using CPM. It also discusses how PERT differs from CPM in using probabilistic time estimates and not distinguishing between critical and non-critical activities.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Information Retrieval-4(inverted index_&_query handling)Jeet Das
The document describes the process of creating an inverted index to support keyword searching of documents. It discusses storing term postings lists that map terms to the documents that contain them. It also describes techniques like skip pointers, phrase queries, and proximity searches to improve query processing efficiency and support more complex search needs. Precision, recall, and f-score metrics for evaluating information retrieval systems are also summarized.
This document provides an introduction to big data, including its key characteristics of volume, velocity, and variety. It describes different types of big data technologies like Hadoop, MapReduce, HDFS, Hive, and Pig. Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. MapReduce is a programming model used for processing large datasets in a distributed computing environment. HDFS provides a distributed file system for storing large datasets across clusters. Hive and Pig provide data querying and analysis capabilities for data stored in Hadoop clusters using SQL-like and scripting languages respectively.
This document defines key concepts in service-oriented architectures (SOA) including services, components, standards like SOAP, WSDL, and UDDI. It describes how SOA uses loosely coupled services that communicate through standardized web protocols. Services are defined through WSDL interfaces and discovered through UDDI directories. SOAP is the messaging standard used to enable communication between services. Orchestration and choreography standards like WS-BPEL and WS-CDL are used to compose services to create new composite applications and define allowable message exchanges.
This document provides an introduction to text mining, including definitions of text mining and how it differs from data mining. It describes common areas and applications of text mining such as information retrieval, natural language processing, and information extraction. The document outlines the typical process of text mining including preprocessing, feature generation and selection, and different mining techniques. It also discusses common approaches to text mining such as keyword-based analysis and document classification/clustering. Finally, it notes some challenges of text mining related to unstructured text data.
SPARQL 1.1 introduced several new features including:
- Updated versions of the SPARQL Query and Protocol specifications
- A SPARQL Update language for modifying RDF graphs
- A protocol for managing RDF graphs over HTTP
- Service descriptions for describing SPARQL endpoints
- Basic federated query capabilities
- Other minor features and extensions
A graph is a structure composed of a set of vertices (i.e.~nodes, dots) connected to one another by a set of edges (i.e.~links, lines). The concept of a graph has been around since the late 19th century, however, only in recent decades has there been a strong resurgence in the development of both graph theories and applications. In applied computing, since the late 1960s, the interlinked table structure of the relational database has been the predominant information storage and retrieval paradigm. With the growth of graph/network-based data and the need to efficiently process such data, new data management systems have been developed. In contrast to the index-intensive, set-theoretic operations of relational databases, graph databases make use of index-free traversals. This presentation will discuss the graph traversal programming pattern and its application to problem-solving with graph databases.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
The document summarizes two papers about MapReduce frameworks for cloud computing. The first paper describes Hadoop, which uses MapReduce and HDFS to process large amounts of distributed data across clusters. HDFS stores data across cluster nodes in a fault-tolerant manner, while MapReduce splits jobs into parallel map and reduce tasks. The second paper discusses P2P-MapReduce, which allows for a dynamic cloud environment where nodes can join and leave. It uses a peer-to-peer model where nodes can be masters or slaves, and maintains backup masters to prevent job loss if the primary master fails.
The document discusses different theories used in information retrieval systems. It describes cognitive or user-centered theories that model human information behavior and structural or system-centered theories like the vector space model. The vector space model represents documents and queries as vectors of term weights and compares similarities between queries and documents. It was first used in the SMART information retrieval system and involves assigning term vectors and weights to documents based on relevance.
This document discusses various text operations and techniques for automatic indexing in information retrieval systems. It covers topics like tokenization, stop word removal, stemming, term weighting, Zipf's law, Luhn's model of word frequency, and Heap's law on vocabulary growth. The goal of these text operations is to select meaningful index terms from documents to represent their contents and reduce noise for more effective retrieval.
The CIDOC Conceptual Reference Model (CRM) provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation.
The CIDOC CRM is intended to promote a shared understanding of cultural heritage information by providing a common and extensible semantic framework that any cultural heritage information can be mapped to. It is intended to be a common language for domain experts and implementers to formulate requirements for information systems and to serve as a guide for good practice of conceptual modelling. In this way, it can provide the "semantic glue" needed to mediate between different sources of cultural heritage information, such as that published by museums, libraries and archives.
The CIDOC CRM is the culmination of over 10 years work by the CIDOC Documentation Standards Working Group and CIDOC CRM SIG which are working groups of CIDOC. Since 9/12/2006 it is official standard ISO 21127:2006.
Also available as presentation slides (with audio narration), slide-notes and video:
https://cidoc-crm.org/cidoc-crm-tutorial
Video Implementation
Centre for Cultural Informatics, Information Systems Laboratory
Foundation for Research and Technology
https://www.ics.forth.gr/isl/cci
Authoring and Presentation
Stephen Stead, Archaeologist, founder member of CIDOC CRM
Copyright CIDOC CRM, 2008
http://www.cidoc-crm.org/
The document discusses XML signatures, which provide integrity and non-repudiation for XML documents. It describes the components of an XML signature such as SignedInfo, SignatureValue and KeyInfo. It explains different types of XML signatures like enveloping, enveloped and detached signatures. It provides examples of XML signatures and answers questions about canonicalization, transforms, references and other concepts related to XML signatures.
The document discusses signature files, which are used for document retrieval. A signature file creates a compressed representation or "signature" for each document in a database. These signatures are stored in hash tables to allow easy retrieval of matching documents for user queries. Signatures can represent words using triplets of characters and a hash function, or entire documents through concatenation of word signatures or superimposed coding. Signature files provide a quick link between queries and documents but have lower accuracy than inverted files, which are generally better for information retrieval applications.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
An inverted file indexes a text collection to speed up searching. It contains a vocabulary of distinct words and occurrences lists with information on where each word appears. For each term in the vocabulary, it stores a list of pointers to occurrences called an inverted list. Coarser granularity indexes use less storage but require more processing, while word-level indexes enable proximity searches but use more space. The document describes how inverted files are structured and constructed from text and discusses techniques like block addressing that reduce their space requirements.
This document provides an overview of the Web Ontology Language (OWL). It discusses the requirements for ontology languages, the three species of OWL (Lite, DL, Full), the syntactic forms of OWL, and key elements of OWL including classes, properties, restrictions, and boolean combinations. It also covers special properties, datatypes, and versioning information. OWL builds on RDF and RDF Schema to provide a stronger language for defining ontologies with greater machine interpretability on the semantic web.
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...confluent
(Bernd Ruecker, Camunda) Kafka Summit SF 2018
A company’s business processes typically span more than one microservice. In an e-commerce company, for example, a customer order might involve microservices for payments, inventory, shipping and more. Implementing long-running, asynchronous and complex collaboration of distributed microservices is challenging. How can we ensure visibility of cross-microservice flows and provide status and error monitoring? How do we guarantee that overall flows always complete, even if single services fail? Or, how do we at least recognize stuck flows so that we can fix them?
In this talk, I’ll demonstrate an approach based on real-life projects using the open source workflow engine zeebe.io to orchestrate microservices. Zeebe can connect to Kafka to coordinate workflows that span many microservices, providing end-to-end process visibility without violating the principles of loose coupling and service independence. Once an orchestration flow starts, Zeebe ensures that it is eventually carried out, retrying steps upon failure. In a Kafka architecture, Zeebe can easily produce events (or commands) and subscribe to events that will be correlated to workflows. Along the way, Zeebe facilitates monitoring and visibility into the progress and status of orchestration flows. Internally, Zeebe works as a distributed, event-driven and event-sourced system, making it not only very fast but horizontally scalable and fault tolerant—and able to handle the throughput required to operate alongside Kafka in a microservices architecture. Expect not only slides but also fun little live-hacking sessions and real-life stories.
This document discusses project management techniques for network analysis, specifically Critical Path Method (CPM) and Program Evaluation and Review Technique (PERT). It defines key concepts like activities, events, predecessors, successors. It provides an example network diagram and shows how to calculate earliest start/finish times, latest start/finish times, and float using CPM. It also discusses how PERT differs from CPM in using probabilistic time estimates and not distinguishing between critical and non-critical activities.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Information Retrieval-4(inverted index_&_query handling)Jeet Das
The document describes the process of creating an inverted index to support keyword searching of documents. It discusses storing term postings lists that map terms to the documents that contain them. It also describes techniques like skip pointers, phrase queries, and proximity searches to improve query processing efficiency and support more complex search needs. Precision, recall, and f-score metrics for evaluating information retrieval systems are also summarized.
This document provides an introduction to big data, including its key characteristics of volume, velocity, and variety. It describes different types of big data technologies like Hadoop, MapReduce, HDFS, Hive, and Pig. Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. MapReduce is a programming model used for processing large datasets in a distributed computing environment. HDFS provides a distributed file system for storing large datasets across clusters. Hive and Pig provide data querying and analysis capabilities for data stored in Hadoop clusters using SQL-like and scripting languages respectively.
This document defines key concepts in service-oriented architectures (SOA) including services, components, standards like SOAP, WSDL, and UDDI. It describes how SOA uses loosely coupled services that communicate through standardized web protocols. Services are defined through WSDL interfaces and discovered through UDDI directories. SOAP is the messaging standard used to enable communication between services. Orchestration and choreography standards like WS-BPEL and WS-CDL are used to compose services to create new composite applications and define allowable message exchanges.
This document provides an introduction to text mining, including definitions of text mining and how it differs from data mining. It describes common areas and applications of text mining such as information retrieval, natural language processing, and information extraction. The document outlines the typical process of text mining including preprocessing, feature generation and selection, and different mining techniques. It also discusses common approaches to text mining such as keyword-based analysis and document classification/clustering. Finally, it notes some challenges of text mining related to unstructured text data.
SPARQL 1.1 introduced several new features including:
- Updated versions of the SPARQL Query and Protocol specifications
- A SPARQL Update language for modifying RDF graphs
- A protocol for managing RDF graphs over HTTP
- Service descriptions for describing SPARQL endpoints
- Basic federated query capabilities
- Other minor features and extensions
A graph is a structure composed of a set of vertices (i.e.~nodes, dots) connected to one another by a set of edges (i.e.~links, lines). The concept of a graph has been around since the late 19th century, however, only in recent decades has there been a strong resurgence in the development of both graph theories and applications. In applied computing, since the late 1960s, the interlinked table structure of the relational database has been the predominant information storage and retrieval paradigm. With the growth of graph/network-based data and the need to efficiently process such data, new data management systems have been developed. In contrast to the index-intensive, set-theoretic operations of relational databases, graph databases make use of index-free traversals. This presentation will discuss the graph traversal programming pattern and its application to problem-solving with graph databases.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
The document summarizes two papers about MapReduce frameworks for cloud computing. The first paper describes Hadoop, which uses MapReduce and HDFS to process large amounts of distributed data across clusters. HDFS stores data across cluster nodes in a fault-tolerant manner, while MapReduce splits jobs into parallel map and reduce tasks. The second paper discusses P2P-MapReduce, which allows for a dynamic cloud environment where nodes can join and leave. It uses a peer-to-peer model where nodes can be masters or slaves, and maintains backup masters to prevent job loss if the primary master fails.
The document discusses different theories used in information retrieval systems. It describes cognitive or user-centered theories that model human information behavior and structural or system-centered theories like the vector space model. The vector space model represents documents and queries as vectors of term weights and compares similarities between queries and documents. It was first used in the SMART information retrieval system and involves assigning term vectors and weights to documents based on relevance.
This document discusses various text operations and techniques for automatic indexing in information retrieval systems. It covers topics like tokenization, stop word removal, stemming, term weighting, Zipf's law, Luhn's model of word frequency, and Heap's law on vocabulary growth. The goal of these text operations is to select meaningful index terms from documents to represent their contents and reduce noise for more effective retrieval.
Lecture 7- Text Statistics and Document ParsingSean Golliher
This document discusses various techniques for text processing and indexing documents for information retrieval systems. It covers topics like tokenization, stemming, stopwords, n-grams to identify phrases, and weighting important document elements like headers, anchor text, and metadata. The document also discusses using links between documents for link analysis and utilizing anchor text for retrieval.
Natural language processing (NLP) involves analyzing and understanding human language to allow interaction between computers and humans. The document outlines key steps in NLP including morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis to convert text into structured representations. It also discusses statistical NLP and real-world applications such as machine translation, question answering, and speech recognition.
Natural language processing (NLP) is introduced, including its definition, common steps like morphological analysis and syntactic analysis, and applications like information extraction and machine translation. Statistical NLP aims to perform statistical inference for NLP tasks. Real-world applications of NLP are discussed, such as automatic summarization, information retrieval, question answering and speech recognition. A demo of a free NLP application is presented at the end.
This document provides a survey of machine transliteration approaches. It begins with an introduction to machine transliteration, defining it as the process of transforming the script of a word from one language to another while preserving pronunciation. The survey then reviews the key methodologies introduced in the transliteration literature, categorizing approaches based on the resources and algorithms used and comparing their effectiveness.
This document discusses corpus linguistics and quantitative research design. It defines a corpus as a large collection of texts used for linguistic analysis. Corpus linguistics allows researchers to empirically test hypotheses about language patterns and features based on large amounts of real-world data. Quantitative analysis of corpus data shows how frequently certain words, constructions, and patterns are used. Specialized corpora can focus on particular text types, languages, or learner language. Various software tools are used to analyze corpora through frequency lists, keyword lists, collocation analysis, and other methods.
What are the basics of Analysing a corpus? chpt.10 RoutledgeRajpootBhatti5
This document provides an overview of the basics of analyzing a corpus through various techniques including frequency analysis, normalization, keyword analysis, and concordance analysis. It explains that frequency lists show how often words occur, normalization adjusts for corpus size differences, keyword analysis finds statistically significant words compared to a reference corpus, and concordance analysis displays keywords in context to better understand usage. The document serves as an introduction to basic corpus analysis methods and tools.
The document discusses text mining, including defining it as the extraction of information from unstructured text using computational methods. It covers topics such as structured vs unstructured data, common text mining practice areas like information retrieval and document clustering, and challenges in text mining including ambiguity in language. Pre-processing techniques for text mining are also outlined, such as normalization, tokenization, stemming and removing stop words to clean and prepare text for analysis.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
This document discusses text operations for information retrieval systems, including tokenization, stemming, and removing stop words. It explains that tokenization breaks text into discrete tokens or words. Stemming reduces words to their root form by removing affixes like prefixes and suffixes. Stop words, which are very common words like "the" and "of", are filtered out since they provide little meaning. The goal of these text operations is to select more meaningful index terms to represent document contents for retrieval tasks.
1. TFIDF is a common technique for encoding documents as vectors based on the terms they contain. It weights terms based on their frequency in the given document (TF) and their overall frequency in the entire corpus (IDF).
2. To create a TFIDF vector, documents are first represented as bags of words based on a master list of terms. The vector entries correspond to terms, with the value being the TFIDF score.
3. Stopwords like "the" and "and" are excluded from the master term list since they provide little information about document content. Stemming reduces inflected words like "fights" and "fighting" to their stem "fight" to group related concepts.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
Synonymy is an important yet intricate linguistic feature in the field of lexical semantics. Using the 100 million-word British National Corpus (BNC) as data and the software Sketch Engine (SkE) as an analyzing tool, this paper explores the collocational behavior and semantic prosodies of near synonyms in virtue of, owing to, thanks to, as a result of, due to and because of. The results show that these near synonyms differ in their collocational behavior and semantic prosodies. The pedagogical implications of the findings are also discussed.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The role of linguistic information for shallow language processingConstantin Orasan
The document discusses shallow language processing and summarization. It argues that while deep language understanding is limited, shallow methods can be improved by adding linguistic information. As an example, it shows how term frequency, anaphora resolution, discourse cues and genetic algorithms can select extractive summaries that better match human abstracts, without requiring full text comprehension.
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...Joshua Orris
Results of geophysics and pneumatic injection pilot tests during 2003 – 2007 yielded significant positive results for injection delivery design and contaminant mass treatment, resulting in permanent shut-down of an existing groundwater Pump & Treat system.
Accessible source areas were subsequently removed (2011) by soil excavation and treated with the placement of Emulsified Vegetable Oil EVO and zero-valent iron ZVI to accelerate treatment of impacted groundwater in overburden and weathered fractured bedrock. Post pilot test and post remediation groundwater monitoring has included analyses of CVOCs, organic fatty acids, dissolved gases and QuantArray® -Chlor to quantify key microorganisms (e.g., Dehalococcoides, Dehalobacter, etc.) and functional genes (e.g., vinyl chloride reductase, methane monooxygenase, etc.) to assess potential for reductive dechlorination and aerobic cometabolism of CVOCs.
In 2022, the first commercial application of MetaArray™ was performed at the site. MetaArray™ utilizes statistical analysis, such as principal component analysis and multivariate analysis to provide evidence that reductive dechlorination is active or even that it is slowing. This creates actionable data allowing users to save money by making important site management decisions earlier.
The results of the MetaArray™ analysis’ support vector machine (SVM) identified groundwater monitoring wells with a 80% confidence that were characterized as either Limited for Reductive Decholorination or had a High Reductive Reduction Dechlorination potential. The results of MetaArray™ will be used to further optimize the site’s post remediation monitoring program for monitored natural attenuation.
RoHS stands for Restriction of Hazardous Substances, which is also known as t...vijaykumar292010
RoHS stands for Restriction of Hazardous Substances, which is also known as the Directive 2002/95/EC. It includes the restrictions for the use of certain hazardous substances in electrical and electronic equipment. RoHS is a WEEE (Waste of Electrical and Electronic Equipment).
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...Joshua Orris
The incorporation of a 3DCSM and completion of HRSC provided a tool for enhanced, data-driven, decisions to support a change in remediation closure strategies. Currently, an approved pilot study has been obtained to shut-down the remediation systems (ISCO, P&T) and conduct a hydraulic study under non-pumping conditions. A separate micro-biological bench scale treatability study was competed that yielded positive results for an emerging innovative technology. As a result, a field pilot study has commenced with results expected in nine-twelve months. With the results of the hydraulic study, field pilot studies and an updated risk assessment leading site monitoring optimization cost lifecycle savings upwards of $15MM towards an alternatively evolved best available technology remediation closure strategy.
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...Open Access Research Paper
Water polluted by dyestuffs compounds is a global threat to health and the environment; accordingly, we prepared a green novel sorbent chemical and Physical system from an algae, chitosan and chitosan nanoparticle and impregnated with algae with chitosan nanocomposite for the sorption of Malachite green dye from water. The algae with chitosan nanocomposite by a simple method and used as a recyclable and effective adsorbent for the removal of malachite green dye from aqueous solutions. Algae, chitosan, chitosan nanoparticle and algae with chitosan nanocomposite were characterized using different physicochemical methods. The functional groups and chemical compounds found in algae, chitosan, chitosan algae, chitosan nanoparticle, and chitosan nanoparticle with algae were identified using FTIR, SEM, and TGADTA/DTG techniques. The optimal adsorption conditions, different dosages, pH and Temperature the amount of algae with chitosan nanocomposite were determined. At optimized conditions and the batch equilibrium studies more than 99% of the dye was removed. The adsorption process data matched well kinetics showed that the reaction order for dye varied with pseudo-first order and pseudo-second order. Furthermore, the maximum adsorption capacity of the algae with chitosan nanocomposite toward malachite green dye reached as high as 15.5mg/g, respectively. Finally, multiple times reusing of algae with chitosan nanocomposite and removing dye from a real wastewater has made it a promising and attractive option for further practical applications.
Improving the viability of probiotics by encapsulation methods for developmen...Open Access Research Paper
The popularity of functional foods among scientists and common people has been increasing day by day. Awareness and modernization make the consumer think better regarding food and nutrition. Now a day’s individual knows very well about the relation between food consumption and disease prevalence. Humans have a diversity of microbes in the gut that together form the gut microflora. Probiotics are the health-promoting live microbial cells improve host health through gut and brain connection and fighting against harmful bacteria. Bifidobacterium and Lactobacillus are the two bacterial genera which are considered to be probiotic. These good bacteria are facing challenges of viability. There are so many factors such as sensitivity to heat, pH, acidity, osmotic effect, mechanical shear, chemical components, freezing and storage time as well which affects the viability of probiotics in the dairy food matrix as well as in the gut. Multiple efforts have been done in the past and ongoing in present for these beneficial microbial population stability until their destination in the gut. One of a useful technique known as microencapsulation makes the probiotic effective in the diversified conditions and maintain these microbe’s community to the optimum level for achieving targeted benefits. Dairy products are found to be an ideal vehicle for probiotic incorporation. It has been seen that the encapsulated microbial cells show higher viability than the free cells in different processing and storage conditions as well as against bile salts in the gut. They make the food functional when incorporated, without affecting the product sensory characteristics.
2. Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a
corpus?
Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.
A few words are very common.
2 most frequent words (e.g.“the”,“of”) can account for about 10% of
word occurrences in a document.
Most words are very rare.
Half the words in a corpus appear only once, called “read only once”
Called a “heavy tailed” distribution, since most of the probability
mass is in the “tail”
2
4. Zipf’s distributions
Rank Frequency Distribution
For all the words in a documents collection, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
(The most commonly occurring word has rank 1, etc.)
f
r
w has rank r and
frequency f
Distribution of sorted word
frequencies, according to
Zipf’s law
4
5. Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor George
Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. , number of
occurances ) of the words within a text.
Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence
(most frequent words first), the occurence characterstics of the
vocabulary can be characterized by the constant rank-frequency
law of Zipf:
Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their frequency, f,
they roughly fit the relation: r * f = c
Different collections have different constants c.
5
6. Example: Zipf's Law
The table shows the most frequently occurring words from 336,310
document collection containing 125, 720, 891 total words; out of
which there are 508,209 unique words
6
7. Zipf’s law: modeling word distribution
The collection frequency of the ith most common term
is proportional to 1/i
If the most frequent term occurs f1 then the second most
frequent term has half as many occurrences, the third most
frequent term has a third as many, etc
Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent
word
occurrence of some event ( P ), as a function of the rank (i)
when the rank is determined by the frequency of occurrence, is
a power-law function Pi ~ 1/i Ө with the exponent Ө close to
unity.
i
fi
1
7
8. More Example: Zipf’s Law
Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
8
9. Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to
terms based on their frequency, with most
frequent words weighted less. Used by almost all
ranking methods.
9
10. Explanations for Zipf’s Law
The law has been explained by “principle of least effort” which
makes it easier for a speaker or writer of a language to repeat
certain words instead of coining new and different words.
Zipf’s explanation was his “principle of least effort” which balance between
speaker’s desire for a small vocabulary and hearer’s desire for a large one.
Zipf’s Law Impact on IR
Good News: Stopwords will account for a large fraction of text so
eliminating them greatly reduces inverted-index storage costs.
Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis for
query expansion) is difficult since they are extremely rare.
10
11. Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word
occurrence in a text furnishes a useful
measurement of word significance.
Luhn suggested that both extremely common
and extremely uncommon words were not
very useful for indexing.
11
12. For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
The words exceeding the upper cutoff were considered to be
common
The words below the lower cutoff were considered to be rare
Hence they are not contributing significantly to the content of the
text
The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
Let f be the frequency of occurrence of words in a text, and r their rank in
decreasing order of word frequency, then a plot relating f & r yields the
following curve
Word significance: Luhn’s Ideas
12
13. Luhn’s Ideas
Luhn (1958) suggested that both extremely common and
extremely uncommon words were not very useful for document
representation & indexing.
13
14. Vocabulary size : Heaps’ Law
Heap’s law:
How to estimates the number of vocabularies in a given corpus
Dictionaries
600,000 words and above
But they do not include names of people, locations, products etc
14
15. Vocabulary Growth: Heaps’ Law
How does the size of the overall vocabulary (number of unique
words) grow with the size of the corpus?
This determines how the size of the inverted index will scale with the size
of the corpus.
Heap’s law: estimates the number of vocabularies in a given
corpus
The vocabulary size grows by O(nβ), where β is a constant between 0 – 1.
If V is the size of the vocabulary and n is the length of the corpus in words,
Heap’s provides the following equation:
Where constants:
K 10100
0.40.6 (approx. square-root)
Kn
V
15
16. Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
• Example: from 1,000,000,000 words, there may be
1,000,000 distinct words. Can you agree?
16
17. Example
We want to estimate the size of the vocabulary for a corpus of
1,000,000 words. However, we only know the statistics computed
on smaller corpora sizes:
For 100,000 words, there are 50,000 unique words
For 500,000 words, there are 150,000 unique words
Estimate the vocabulary size for the 1,000,000 words corpus?
How about for a corpus of 1,000,000,000 words?
17
19. Issues: recall and precision
breaking up hyphenated terms increase recall but
decrease precision
preserving case distinctions enhance precision but
decrease recall
commercial information systems usually take recall
enhancing approach (numbers and words containing
digits are index terms, and all are case insensitive)
19
20. Text Operations
Not all words in a document are equally significant to represent the
contents/meanings of a document
Some word carry more meaning than others
Noun words are the most representative of a document content
Therefore, need to preprocess the text of a document in a collection to be
used as index terms
Using the set of all words in a collection to index documents creates too
much noise for the retrieval task
Reduce noise means reduce words which can be used to refer to the document
Preprocessing is the process of controlling the size of the vocabulary or the
number of distinct words used as index terms
Preprocessing will lead to an improvement in the information retrieval performance
However, some search engines on theWeb omit preprocessing
Every word in the document is an index term
20
21. Text Operations
Text operations is the process of text transformations in to
logical representations
The main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms:
Lexical analysis/Tokenization of the text - digits, hyphens, punctuations
marks, and the case of letters
Elimination of stop words - filter out words which are not useful in the
retrieval process
Stemming words - remove affixes (prefixes and suffixes)
Construction of term categorization structures such as thesaurus, to
capture relationship for allowing the expansion of the original query
with related terms
21
22. Generating Document Representatives
Text Processing System
Input text – full text, abstract or title
Output – a document representative adequate for use in an automatic retrieval
system
The document representative consists of a list of class names, each name
representing a class of words occurring in the total input text. A document
will be indexed by a name if one of its significant words occurs as a member of that
class.
Tokenization stemming Thesaurus
Index
terms
stop words
documents
22
23. Lexical Analysis/Tokenization of Text
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text
Digits, hyphens, punctuation marks, case of letters
Numbers are not good index terms (like 1910, 1999); but 510 B.C. –
unique
Hyphen – break up the words (e.g. state-of-the-art = state of the art)-
but some words, e.g. gilt-edged, B-49 - unique words which require
hyphens
Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe
Case of letters – not important and can convert all to upper or lower
23
24. Tokenization
Analyze text into a sequence of discrete tokens (words).
Input:“Friends,Romans and Countrymen”
Output:Tokens (an instance of a sequence of characters that are grouped
together as a useful semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an index entry, after
further processing
But what are valid tokens to emit?
24
25. Issues in Tokenization
One word or multiple: How do you decide it is one token or two
or more?
Hewlett-Packard Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
San Francisco,LosAngeles
lowercase, lower-case, lower case ?
data base, database, data-base
• Numbers:
dates (3/12/91 vs. Mar. 12, 1991);
phone numbers,
IP addresses (100.2.86.144)
25
26. Issues in Tokenization
How to handle special cases involving apostrophes, hyphens etc?
C++, C#, URLs, emails, …
Sometimes punctuation (e-mail), numbers (1999), and case (Republican
vs. republican) can be a meaningful part of a token.
However, frequently they are not.
Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic characters
as tokens.
Generally, don’t index numbers as text, But often very useful.Will often
index “meta-data” , including creation date, format, etc. separately
Issues of tokenization are language specific
Requires the language to be known
The application area for which the IR system would be
developed for also dictates the nature of valid tokens
26
27. Exercise: Tokenization
The cat slept peacefully in the living room. It’s a very old
cat.
Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
27
28. Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
They may occur in 80% of the documents in a collection.
They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out from index list
Examples of stopwords:
articles (a, an, the);
pronouns: (I, he, she, it, their, his)
prepositions (on, of, in, about, besides, against),
conjunctions/ connectors (and, but, for, nor, or, so, yet),
verbs (is, are, was, were),
adverbs (here, there, out, because, soon, after) and
adjectives (all, any, each, every, few, many, some)
Stopwords are language dependent.
28
29. Stopwords
Intuition:
Stopwords have little semantic content; It is typical to remove
such high-frequency words
Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
Smaller indices for information retrieval
Good compression techniques for indices:The 30 most common words
account for 30% of the tokens in written text
Better approximation of importance for classification,
summarization, etc.
29
30. How to determine a list of stopwords?
One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
In a collection about insurance practices,“insurance” would be a
stop word
Another method: Build a stop word list that contains a set of
articles, pronouns, etc.
Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.
With the removal of stopwords, we can measure better
approximation of importance for classification, summarization,
etc.
30
31. Stop words
Stop word elimination used to be standard in older IR systems.
But the trend is away from doing this. Most web search engines
index stop words:
Good query optimization techniques mean you pay little at
query time for including stop words.
You need stopwords for:
Phrase queries:“King of Denmark”
Various song titles, etc.:“Let it be”,“To be or not to be”
“Relational” queries:“flights to London”
Elimination of stopwords might reduce recall (e.g.“To be or
not to be” – all eliminated except “be” – no or irrelevant
retrieval)
31
32. Normalization
• It is Canonicalizing tokens so that matches occur despite
superficial differences in the character sequences of
the tokens
Need to “normalize” terms in indexed text as well as query terms into the
same form
Example:We want to match U.S.A. and USA,by deleting periods in a
term
Case Folding: Often best to lower case everything, since users will
use lowercase regardless of‘correct’ capitalization…
Republican vs. republican
Fasil vs. fasil vs.FASIL
Anti-discriminatory vs. antidiscriminatory
Car vs. automobile?
32
33. Normalization issues
Good for
Allow instances of Automobile at the beginning of a sentence to match
with a query of automobile
Helps a search engine when most users type ferrari when they are
interested in a Ferrari car
Bad for
Proper names vs. common nouns
E.g. General Motors,Associated Press, …
Solution:
lowercase only words at the beginning of the sentence
In IR, lowercasing is most practical because of the way users issue
their queries
33
34. Term conflation
One of the problems involved in the use of free text for
indexing and retrieval is the variation in word forms that is
likely to be encountered.
The most common types of variations are
spelling errors (father, fathor)
alternative spellings i.e. locality or national usage (color vs
colour, labor vs labour)
multi-word concepts (database, data base)
affixes (dependent, independent, dependently)
abbreviations (i.e., that is).
34
35. Function of conflation in terms of IR
Reducing the total number of distinct terms to a
consequent reduction of dictionary size and
updating problems
Less terms to worry about
Bringing similar words, having similar meanings
to a common form with the aim of increasing
retrieval effectiveness.
More accurate statistics
35
36. Stemming/Morphological analysis
Stemming reduces tokens to their “root” form of words to
recognize morphological variation .
The process involves removal of affixes (i.e. prefixes and
suffixes) with the aim of reducing variants to the same stem
Often removes inflectional and derivational morphology of a
word
Inflectional morphology: vary the form of words in order to express
grammatical features, such as singular/plural or past/present tense. E.g.Boy →
boys, cut → cutting.
Derivational morphology: makes new words from old ones. E.g. creation is
formed from create , but they are two separate words. And also, destruction
→ destroy
Stemming is language dependent
Correct stemming is language specific and can be complex.
for example compressed and
compression are both accepted.
for example compress and
compress are both accept
36
37. Stemming
The final output from a conflation algorithm is a set of classes, one for
each stem detected.
A Stem: the portion of a word which is left after the removal of its affixes
(i.e., prefixes and/or suffixes).
Example:‘connect’ is the stem for {connected, connecting connection,
connections}
Thus, [automate, automatic, automation] all reduce to
automat
A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.
A document representative then becomes a list of class names, which are
often referred as the documents index terms/keywords.
Queries : Queries are handled in the same way.
37
38. Ways to implement stemming
There are basically two ways to implement stemming.
The first approach is to create a big dictionary that maps words to
their stems.
The advantage of this approach is that it works perfectly (insofar as the
stem of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
The second approach is to use a set of rules that extract stems
from words.
The advantages of this approach are that the code is typically small, and it
can gracefully handle new words; the disadvantage is that it occasionally
makes mistakes.
But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
38
39. Assumptions in Stemming
Words with the same stem are semantically
related and have the same meaning to the user of
the text
The chance of matching increase when the index
term are reduced to their word stems because it
is normal to search using “ retrieve” than
“retrieving”
39
40. Stemming Errors
Under-stemming
the error of taking off too small a suffix
croulons croulon
since croulons is a form of the verb crouler
Over-stemming
the error of taking off too much
example: croûtons croût
since croûtons is the plural of croûton
Miss-stemming
taking off what looks like an ending, but is really part of the stem
reply rep
40
41. Criteria for judging stemmers
Correctness
Overstemming: too much of a term is removed.
Can cause unrelated terms to be conflated retrieval of non-relevant
documents
Understemming: too little of a term is removed.
Prevent related terms from being conflated relevant documents may not be
retrieved
Term Stem
GASES GAS (correct)
GAS GA (over)
GASEOUS GASE (under)
41
42. Porter Stemmer
Stemming is the operation of stripping the suffices from a word,
leaving its stem.
Google, for instance, uses stemming to search for web pages containing the
words connected, connecting, connection and connections when users ask
for a web page that contains the word connect.
In 1979, Martin Porter developed a stemming algorithm that uses a
set of rules to extract stems from words, and though it makes some
mistakes, most common words seem to work out right.
Porter describes his algorithm and provides a reference implementation in
C at http://tartarus.org/~martin/PorterStemmer/index.html
42
43. Porter stemmer
Most common algorithm for stemming English words to their
common grammatical root
It is simple procedure for removing known affixes in English
without using a dictionary.To gets rid of plurals the following rules
are used:
SSES SS caresses caress
IES y ponies pony
SS SS caress → caress
S cats cat
ment (Delete final element if what remains is longer than 1 character )
replacement replace
cement cement
43
44. Stemming: challenges
May produce unusual stems that are not English words:
Removing‘UAL’ from FACTUAL and EQUAL
May conflate (reduce to the same token) words that are actually
distinct.
“computer”,“computational”,“computation” all reduced to
same token “comput”
Not recognize all morphological derivations.
44
45. Thesauri
Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same concept
Problem:The same meaning can be expressed using different terms that
are synonyms, homonyms, and related terms
How can it be achieved such that for the same meaning the identical terms
are used in the index and the query?
Thesaurus:The vocabulary of a controlled indexing
language, formally organized so that a priori relationships
between concepts (for example as "broader" and “related")
are made explicit.
A thesaurus contains terms and relationships between terms
IR thesauri rely typically upon the use of symbols such as USE/UF
(UF=used for), BT, and RT to demonstrate inter-term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour,paint
45
46. Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by showing a
set of related words to handle synonyms and homonyms
The aim of thesaurus is therefore:
to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index such equivalences
When the document contains automobile, index it under car as well (usually, also
vice-versa)
to assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query
to provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs
46
47. Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
RoadTransport
47
48. More Example
Example: thesaurus built to assist IR in the fields of computer
science:
TERM: natural languages
UF natural language processing (UF=used for NLP)
BT languages (BT=broader term is languages)
TT languages (TT = top term is languages)
RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
48
49. Language-specificity
Many of the above features embody transformations that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the indexing process
Both open source and commercial plug-ins are available for
handling these
49
50. INDEX TERM SELECTION
Index language is the language used to describe documents and
requests
Elements of the index language are index terms which may be
derived from the text of the document to be described, or may be
arrived at independently.
If a full text representation of the text is adopted, then all words in
the text are used as index terms = full text indexing
Otherwise, need to select the words to be used as index terms for
reducing the size of the index file which is basic to design an
efficient searching IR system
50
52. Terms
52
Terms are usually stems.Terms can be also phrases, such as
“Computer Science”,“WorldWideWeb”, etc.
Documents as well as queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Wij=0 if a term is absent
Documents are represented by binary weights or Non-
binary weighted vectors of terms.
...,
,
,...,
,
,
2
1
2
1
qn
q
q
d
d
d
i
w
w
w
Q
w
w
w
D in
i
i
53. Document Collection
53
A collection of n documents can be represented in the vector
space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
54. Binary Weights
• Only the presence (1) or absence
(0) of a term is included in the
vector
• Binary formula gives every word
that appears in a document equal
relevance.
• It can be useful when frequency
is not important.
• Binary Weights Formula:
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1
D6 1 1 0
D7 0 1 0
D8 0 1 0
D9 0 0 1
D10 0 1 1
D11 1 0 1
0
if
0
0
if
1
ij
ij
ij
freq
freq
freq
55. Why use term weighting?
55
Binary weights are too limiting.
terms are either present or absent.
Does NOT allow to order documents according to their
level of relevance for a given query
Non-binary weights allow to model partial matching .
Partial matching allows retrieval of docs that approximate
the query.
Term-weighting improves quality of answer set
Term weighting enables ranking of retrieved
documents; such that best matching documents are
ordered at the top as they are more relevant than others.
56. Term Weighting: Term Frequency (TF)
TF (term frequency) - Count the
number of times term occurs in document.
fij = frequency of term i in doc j
The more times a term t occurs in
document d the more likely it is that t is
relevant to the document, i.e. more
indicative of the topic..
If used alone, it favors common words
and long documents.
It gives too much credit to words that
appears more frequently.
We may want to normalize term
frequency (tf) across the entire document:
docs t1 t2 t3
D1 2 0 3
D2 1 0 0
D3 0 4 7
D4 3 0 0
D5 1 6 3
D6 3 5 0
D7 0 8 0
D8 0 10 0
D9 0 0 1
D10 0 3 5
D11 4 0 1
Q 1 0 1
q1 q2 q3
)
Max(tf
tf
tf
j
ij
ij
57. Document Normalization
57
Long documents have an unfair advantage:
They use a lot of terms
So they get more matches than short documents
And they use the same words repeatedly
So they have large term frequencies
Normalization seeks to remove these effects:
Related somehow to maximum term frequency.
But also sensitive to the number of terms.
What would happen if term frequency in a document
are not normalized?
short documents may not be recognized as
relevant.
58. Problems with term frequency
Need a mechanism for reduce the effect of terms that occur too
often in the collection to be meaningful for relevance/meaning
determination
Scale down the term weight of terms with high collection
frequency
Reduce the tf weight of a term by a factor that grows with the collection
frequency
More common for this purpose is document frequency
how many documents in the collection contain the term
58
• The example shows that collection
frequency (cf) and document
frequency (df) behaves differently
59. Document Frequency
59
It is defined to be the number of documents in the collection
that contain a term
DF = document frequency
Count the frequency considering the whole
collection of documents.
Less frequently a term appears in the whole
collection, the more discriminating it is.
df i = document frequency of term i
= number of documents containing term i
60. Inverse Document Frequency (IDF)
60
IDF measures how rare a term is in collection.
The IDF is a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very frequently in
the collection and increases the weight of terms that occur
rarely.
Gives full weight to terms that occur in one document only.
Gives lowest weight to terms that occur in all documents. (log2(1)=0)
Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
(N: total number of documents)
)
(
log2
i
i
df
N
idf
61. Inverse Document Frequency
61
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• The log is used to reduce the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ?
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF idf
the 1000 1000 log2(1000/1000)=log2(1) =0
some 1000 100 log2(1000/100)=log2(10) = 3.322
car 1000 10 log2(1000/10)=log2(100) = 6.644
merge 1000 1 log2(1000/1)=log2(1000) = 9.966
62. TF*IDF Weighting
62
The most used term-weighting is tf*idf weighting
scheme:
wij = tfij x idfi = tfij x log2 (N/ dfi)
A term occurring frequently in the document but rarely
in the rest of the collection is given high weight.
The tf-idf value for a term will always be greater than
or equal to zero.
Experimentally, tf*idf has been found to work well.
It is often used in the vector space model together
with cosine similarity to determine the similarity
between two documents.
63. TF*IDF weighting
When does TF*IDF registers a high weight?
when a term t occurs many times within a small number of
documents
Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low document
frequency (in the whole collection of documents);
the weights hence tend to filter out common terms thus giving
high discriminating power to those documents
LowerTF*IDF is registered when the term occurs fewer times in a
document, or occurs in many documents
Thus offering a less pronounced relevance signal
LowestTF*IDF is registered when the term occurs in virtually all
documents
64. Computing TF-IDF: An Example
64
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms
are: A(50), B(1300), C(250).And also term frequencies (TF)
of these terms are:A(3), B(2), C(1).
ComputeTF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644;
tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322;
tf*idf = 1.774
Query vector is typically treated as a document and also tf-idf
weighted.
65. More Example
65
Consider a document containing 100 words wherein the word cow
appears 3 times. Now, assume we have 10 million documents and cow
appears in one thousand of these.
The term frequency (TF) for cow :
3/100 = 0.03
The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
TheTF*IDF score is the product of these frequencies: 0.03 * 13.228 =
0.39684
66. Exercise
66
• Let C = number of
times a given word
appears in a
document;
• TW = total number of
words in a document;
• TD = total number of
documents in a
corpus, and
• DF = total number of
documents containing
a given word;
• compute TF, IDF and
TF*IDF score for each
term
Word C TW TD DF TF IDF TFIDF
airplane 5 46 3 1
blue 1 46 3 1
chair 7 46 3 3
computer 3 46 3 1
forest 2 46 3 1
justice 7 46 3 3
love 2 46 3 1
might 2 46 3 1
perl 5 46 3 2
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
67. Concluding remarks
67
Suppose from a set of English documents, we wish to determine which once
are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do not contain all
three words "the," "brown," and "cow," but this still leaves many documents.
To further distinguish them, we might count the number of times each term
occurs in each document and sum them all together;
the number of times a term occurs in a document is called itsTF. However, because
the term "the" is so common, this will tend to incorrectly emphasize documents
which happen to use the word "the" more, without giving enough weight to the
more meaningful terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant and non-relevant
documents and terms like "brown" and "cow" that occur rarely are good keywords
to distinguish relevant documents from the non-relevant once.
68. Concluding remarks
Hence IDF is incorporated which diminishes the weight of
terms that occur very frequently in the collection and increases
the weight of terms that occur rarely.
This leads to useTF*IDF as a better weighting technique
On top of that we apply similarity measures to calculate the
distance between document i and query j.
• There are a number of similarity measures; the most common similarity
measure are
Euclidean distance , Inner or Dot product, Cosine similarity, Dice
similarity, Jaccard similarity, etc.
69. Similarity Measure
69
We now have vectors for all documents in the
collection, a vector for the query, how to
compute similarity?
A similarity measure is a function that
computes the degree of similarity or distance
between document vector and query vector.
Using a similarity measure between the query
and each document:
It is possible to rank the retrieved
documents in the order of presumed
relevance.
It is possible to enforce a certain threshold so
that the size of the retrieved set can be
controlled (minimized).
2
t3
t1
t2
D1
D2
Q
1
70. Intuition
Postulate: Documents that are “close together” in the
vector space talk about the same things and more similar
than others.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
71. Similarity Measure
71
Assumptions for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 is near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
Sometimes it is a good idea to determine the maximum possible
similarity as the “distance” between a document d and itself
A similarity measure attempts to compute the distance between
document vector wj and query vector wq.
The assumption here is that documents whose vectors are
close to the query vector are more relevant to the query
than documents whose vectors are away from the query
vector.
72. Similarity Measure: Techniques
72
• Euclidean distance
It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates of a
pair of document and query terms.
• Dot product
The dot product is also known as scalar product or inner
product
the dot product is defined as the product of the magnitudes of
query and document vectors
Cosine similarity
Also called normalized inner product
It projects document and query vectors into a term space and
calculate the cosine angle between these.
73. Euclidean distance
73
Similarity between vectors for the document di and query q can
be computed as:
sim(dj,q) = |dj – q| =
where wij is the weight of term i in document j and wiq is the
weight of term i in the query
• Example: Determine the Euclidean distance between the document 1
vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means
corresponding term not found in document or query
n
i
iq
ij w
w
1
2
)
(
05
.
11
)
0
10
(
)
0
1
(
)
1
2
(
)
7
3
(
)
2
0
( 2
2
2
2
2
74. Inner Product
74
Similarity between vectors for the document di and query q can
be computed as the vector inner product:
sim(dj,q) = dj•q = wij · wiq
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
n
i 1
75. Properties of Inner Product
75
Favors long documents with a large number of
unique terms.
Why?....
Again, the issue of normalization needs to be
considered
Measures how many terms matched but not how
many terms are not matched.
The unmatched terms will have value zero as either
the document vector or the query vector will have a
weight of ZERO
79. Cosine similarity
Measures similarity between d1 and d2 captured by the cosine of
the angle x between them.
Or;
The denominator involves the lengths of the vectors
The cosine measure is also known as normalized inner product
n
i q
i
n
i j
i
n
i q
i
j
i
j
j
j
w
w
w
w
q
d
q
d
q
d
sim
1
2
,
1
2
,
1 ,
,
)
,
(
n
i j
i
j w
d 1
2
,
Length
n
i k
i
n
i j
i
n
i i,k
i,j
k
j
k
j
k
j
w
w
w
w
d
d
d
d
d
d
sim
1
2
,
1
2
,
1
)
,
(
81. Example: Computing Cosine Similarity
81
98
.
0
cos
74
.
0
cos
2
1
2
1
1
D
Q
2
D
1.0
0.8
0.6
0.8
0.4
0.6
0.4 1.0
0.2
0.2
• Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4,
0.8), determine which document is most relevant for the
query?
82. Example
82
Given three documents; D1,D2 and D3 with the corresponding
TF*IDF weight, Which documents are more similar using the
three measurement?
Terms D1 D2 D3
affection 2 3 1
Jealous 0 2 4
gossip 1 0 2
83. Cosine Similarity vs. Inner Product
83
Cosine similarity measures the cosine of the angle between
two vectors.
Inner product normalized by the vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5
times better using inner product.
t
i
t
i
t
i
w
w
w
w
q
d
q
d
iq
ij
iq
ij
j
j
1 1
2
2
1
)
(
CosSim(dj, q) =
q
dj
InnerProduct(dj, q) =
84. Exercises
84
A database collection consists of 1 million documents, of which
200,000 contain the term holiday while 250,000 contain the
term season.A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than any other
term in the document. Calculate the weight of both terms in this
document using three different term weight methods.Try with
(i) normalized and unnormalized TF;
(ii)TF*IDF based on normalized and unnormalized TF