The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
Knowledge Based Reasoning: Agents, Facets of Knowledge. Logic and Inferences: Formal Logic,
Propositional and First Order Logic, Resolution in Propositional and First Order Logic, Deductive
Retrieval, Backward Chaining, Second order Logic. Knowledge Representation: Conceptual
Dependency, Frames, Semantic nets.
Knowledge Based Reasoning: Agents, Facets of Knowledge. Logic and Inferences: Formal Logic,
Propositional and First Order Logic, Resolution in Propositional and First Order Logic, Deductive
Retrieval, Backward Chaining, Second order Logic. Knowledge Representation: Conceptual
Dependency, Frames, Semantic nets.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
This presentation educates you about Decision making in python with process structure and the decision making statements. statement are - if statements, if-else statements, if-elif ladder, Nested statement.
For more topics stay tuned with Learnbay.
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
This presentation educates you about Decision making in python with process structure and the decision making statements. statement are - if statements, if-else statements, if-elif ladder, Nested statement.
For more topics stay tuned with Learnbay.
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
Maria Mateva provides some insights from her experience about the crossing point between the vector-space model and the content-based approach in recommendations. In the end, she presents latent semantic indexing - a solution to finding relations between the objects in large-dimensional data and some Q&A as usual.
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
The paper accepted on ICSE'17 and TSE'19. https://se-thesaurus.appspot.com/ https://pypi.org/project/DomainThesaurus/ Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonlyused morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
Text based search engine on a fixed corpus and utilizing indexation and ranki...Soham Mondal
A software prototype of a text based search engine which will work on millions of wikipedia pages retrived in xml format and automatically bring-up and analyse the top 10 relevant Wikipedia documents that matches the input query of user. This takes Wikipedia corpus in XML format which is available at Wikipedia.org as input. Then it indices millions of Wikipedia pages involving a comparable number of distinct terms. Given a query, it retrieves relevant ranked documents and their titles using index. It uses OOPs application, ranking algorithms and indexation techniques used in modern search engines. It also showcases high level system design, software architecture modelling and development sprints/implementations
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
2. Boolean Retrieval Model
Processing Boolean queries
To process a simple conjunctive query such as “Brutus AND
Calpurnia” using an inverted index and the basic Boolean retrieval
model, we follow these steps:
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists
3. Boolean Retrieval Model
Processing Boolean queries
The intersection operation is the crucial one: we need to efficiently
intersect postings lists so as to be able to quickly find documents that
contain both terms.
This operation is sometimes referred to as merging postings lists.
4. Boolean Retrieval Model
Processing Boolean queries
If the lengths of the postings lists are x and y, the
intersection takes O(x + y) operations.
Processing more complex queries? Example:
(Brutus OR Caesar) AND NOT Calpurnia
5. Boolean Retrieval Model
Processing Boolean queries
Query optimization: is the process of selecting how to
organize the work of answering a query so that the least
total amount of work needs to be done by the system.
Brutus AND Caesar AND Calpurnia
6. Boolean Retrieval Model
Processing Boolean queries
Brutus AND Caesar AND Calpurnia
A major element is the order in which postings lists are
accessed.
What is the best order for query processing?
(Calpurnia AND Brutus) AND Caesar
7. Boolean Retrieval Model
Processing Boolean queries
if we start by intersecting the two smallest postings lists,
then all intermediate results must be no bigger than the
smallest postings list, and we are therefore likely to do the
least amount of total work.
8. The term vocabulary and postings lists
Choosing a Document Unit
• What is the document unit that should be
used for indexing?
Questio
n
• Text Message
• Attachment (.doc file / .rar file)Email Messages
• Individual Books (entire book as a unit)
• Each Chapter as a Unit
• Individual Sentences
Collection of
Books
Precision
Recall
9. The term vocabulary and postings lists
Determining the vocabulary of terms
Recall the major steps in inverted index construction:
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.
• Tokenization is the process of chopping
character streams into tokens throwing away
certain characters.
Tokenization
• Deals with building equivalence classes of
tokens which are the set of terms that are
indexed
Linguistic
Preprocessing
10. The term vocabulary and postings lists
Determining the vocabulary of terms
Token/Type/or Term?
A token: is an instance of a sequence of characters in some
particular document that are grouped together as a useful
semantic unit for processing.
A type: is the class of all tokens containing the same character
sequence.
A term: is a type that is included in the IR system’s dictionary (a
• Tokenization is the process of chopping character
streams into tokens throwing away certain characters.Tokenization
11. The term vocabulary and postings lists
Determining the vocabulary of terms
What about apostrophe for possession and
contractions?
doc_1 : Dr. Thomas O’Daniel has been the President of Research since
December 2006.
doc_2 : Students’ solutions weren’t correct.
doc_3 : Ahmad’s notebook isn’t cheap.
Example: Query = O’Daniel AND Research
Token 1: o’daniel
Token 2: odaniel
Token 3: o’ daniel
Token 4: o daniel
• what are the correct tokens to use?
Questio
n
12. The term vocabulary and postings lists
Determining the vocabulary of terms
What about tokens associated with special
characters?
doc_1 : C# is a high-level, multi-paradigm, general-purpose programming
language.
doc_2 : C++ (pronounced cee plus plus) is a general purpose
programming language.
doc_3 : A+ is an array programming language descendent from the
programming language A.
Example: Query = C AND programming
Token 1: C#
Token 2: C #
• what are the correct tokens to use?
Questio
n
13. The term vocabulary and postings lists
Determining the vocabulary of terms
What about hyphenated tokens?
doc_1 : C# is a high-level, multi-paradigm, general-purpose programming
language.
doc_2 : C++ (pronounced cee plus plus) is a general purpose
programming language.
doc_3 : A+ is an array programming language descendent from the
programming language A.
Example: Query = general-purpose AND
programming
Token 1: general-purpose
Token 2: general purpose
• what are the correct tokens to use?
Questio
n
14. The term vocabulary and postings lists
Determining the vocabulary of terms
What about tokens that should be regarding as a
single token?
doc_1 : The West Bank, including East Jerusalem, has a land area of
5,640 km2.
doc_2 :The West bank and Gaza Strip.
doc_3 : There is a branch of the Arab Bank in Palestine in the West of
Jenin City.
Example: Query = West Bank AND Palestine
Token 1: West Bank
Token 2: West
Token 3: Bank
• what are the correct tokens to use?
Questio
n
15. The term vocabulary and postings lists
Dropping Common Terms (Stop words Removal)
Using a stop list significantly reduces the number of postings that a
system has to store.
keyword searches with terms like the and by don’t seem very useful.
However, this is not true for phrase searches. The
meaning of flights to London is likely to be lost if the word to is
stopped out.
Example: The phrase query
“President of the United States” or
“Flights to London” is more precise than
“President” AND “United States”. and
“Flights” AND “London”
• some extremely common words which would
appear to be of little value in helping select
documents matching a user need are excluded from
the vocabulary entirely.
Stop
words
16. The term vocabulary and postings lists
Dropping Common Terms (Stop words Removal)
The general trend in IR systems over time has
been:
from standard use of quite large stop lists (200–
300 terms)
to very small stop lists (7–12 terms)
to no stop list whatsoever.
• how we can exploit the statistics of
language so as to be able to cope with
common words in better ways.
Questio
n
• Do we really need to use stop lists.
Questio
n
17. The term vocabulary and postings lists
Normalization (equivalence classing of terms)
Token normalization: is the process of canonicalizing
(standardizing or normalizing) tokens so that matches occur
despite superficial differences in the character sequences of the
tokens.
The easy case is if tokens in the query just match tokens in the
token list of the document.
However, there are many cases when two character sequences are
not quite the same but you would like a match to occur.
Query
• Token1
• Token 2
• …
Document
• Token1
• Token 2
• …
18. The term vocabulary and postings lists
Normalization (equivalence classing of terms)
Create equivalence classes, which are normally named after one
member of the set.
Query
• anti-discriminatory
• co-author
• U.S.A
• …
Document
• antidiscriminatory
• coauthor
• USA
• …
19. The term vocabulary and postings lists
Normalization (equivalence classing of terms)
An alternative is to maintain relations between unnormalized
tokens. This method can be extended to hand-constructed lists of
synonyms such as car and automobile.
These term relationships can be achieved in two ways:
1. The usual way is to index unnormalized tokens and to maintain a
query expansion list of multiple vocabulary entries to consider for a
certain query term.
2. The alternative is to perform the expansion during index
construction.
When the document contains automobile, we index it under car as
well (and, usually, also vice-versa).
Use of either of these methods is considerably less efficient
than equivalence classing, as there are more postings to store and
20. The term vocabulary and postings lists
Accents and Diacritics
Diacritics: signs which when written above or below a letter indicates a
difference in pronunciation from the same letter when unmarked or
differently marked.
In English:
naive and naïve
This can be done by normalizing tokens to remove diacritics.
What about other languages?
َََتبكَوُتبكَوبُتُك
It might be best to equate all words to a form without diacritics.
21. The term vocabulary and postings lists
Capitalization/Case-folding
Case-folding: refers to reducing all letters to lower case.
Naive naive
General Motors general motors
Drew University drew university
Drew West drew west
22. The term vocabulary and postings lists
Capitalization/Case-folding
Case-folding: refers to reducing all letters to lower case.
C.A.T cat
23. The term vocabulary and postings lists
Capitalization/Case-folding
An alternative to making every token lowercase is to just make
some tokens lowercase.
The simplest heuristic is to convert to lowercase words at the
beginning of a sentence and all words occurring in a title that is all
uppercase or in which most or all words are capitalized.
Mid-sentence capitalized words are left as capitalized (which is
usually correct).
However, trying to get capitalization right in this way probably
doesn’t help if your users usually use lowercase regardless of the
correct case of words.
Thus, lowercasing everything often remains the most practical
solution.
24. The term vocabulary and postings lists
Other issues in English
Other possible normalizations are quite idiosyncratic and
particular to English.
For instance, you might wish to equate:
colour and color.
3/12/91 and Mar. 12, 1991
U.S., 3/12/91 is Mar. 12, 1991, whereas in Europe it is 3 Dec 1991.