This document provides an overview of information retrieval and the Boolean model. It defines information retrieval as finding relevant documents from large collections to satisfy an information need. The document introduces the Boolean model, including using a term-document matrix and inverted index to process Boolean queries. It discusses how Boolean queries are implemented by intersecting postings lists and optimizing for conjunctions and disjunctions. The challenges of Boolean search at scale are also covered.
This document provides an introduction to information retrieval. It discusses how Boolean retrieval works using an inverted index to process queries. It also covers how queries can be optimized by processing terms in order of increasing document frequency. Further topics covered include ranking search results, structured versus unstructured data, and more sophisticated techniques in information retrieval.
This document provides an introduction to Boolean retrieval and the inverted index data structure used in information retrieval systems. It discusses how an inverted index represents term-document incidence matrices in a space-efficient manner by only storing non-zero values. Boolean queries are processed by intersecting the postings lists of query terms in the inverted index. While simple, the Boolean model was widely used in commercial search engines for decades due to its precision.
This document provides an overview of a course on Information Storage and Retrieval. The course objectives are to familiarize students with theories of information storage and retrieval, introduce modern concepts of information retrieval systems, and acquaint students with indexing, matching, organizing and evaluating strategies for IR systems. The course content covers topics like text operations, term weighting, indexing structures, IR models, retrieval evaluation, and query languages over 14 weeks. Students will be evaluated through assessments, a project, and a final exam.
The document introduces information retrieval and describes how an inverted index works as the key data structure for modern IR systems. An inverted index stores for each term a list of all documents that contain the term. It allows efficient processing of Boolean queries by merging the postings lists of query terms. Query processing aims to optimize the order of processing terms based on their document frequencies to minimize the size of intermediate results.
The document introduces information retrieval and describes how an inverted index works as the key data structure for modern IR systems. An inverted index stores for each term a list of all documents that contain the term. It allows efficient processing of Boolean queries by merging the postings lists of query terms. Query processing aims to optimize the order of processing terms based on their document frequencies to minimize the size of intermediate results.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
The document discusses Boolean retrieval models and indexing in information retrieval systems. It describes how a Boolean retrieval model views documents as sets of words and uses Boolean operators like AND, OR, and NOT to join query terms. An inverted index data structure is built on the text to speed up searches by storing, for each term, the documents that contain it. The document provides examples of how Boolean queries are processed by intersecting postings lists from the inverted index to retrieve matching documents in linear time. It also discusses some limitations of the Boolean model and advantages of ranking search results.
This document provides an introduction to information retrieval. It discusses how Boolean retrieval works using an inverted index to process queries. It also covers how queries can be optimized by processing terms in order of increasing document frequency. Further topics covered include ranking search results, structured versus unstructured data, and more sophisticated techniques in information retrieval.
This document provides an introduction to Boolean retrieval and the inverted index data structure used in information retrieval systems. It discusses how an inverted index represents term-document incidence matrices in a space-efficient manner by only storing non-zero values. Boolean queries are processed by intersecting the postings lists of query terms in the inverted index. While simple, the Boolean model was widely used in commercial search engines for decades due to its precision.
This document provides an overview of a course on Information Storage and Retrieval. The course objectives are to familiarize students with theories of information storage and retrieval, introduce modern concepts of information retrieval systems, and acquaint students with indexing, matching, organizing and evaluating strategies for IR systems. The course content covers topics like text operations, term weighting, indexing structures, IR models, retrieval evaluation, and query languages over 14 weeks. Students will be evaluated through assessments, a project, and a final exam.
The document introduces information retrieval and describes how an inverted index works as the key data structure for modern IR systems. An inverted index stores for each term a list of all documents that contain the term. It allows efficient processing of Boolean queries by merging the postings lists of query terms. Query processing aims to optimize the order of processing terms based on their document frequencies to minimize the size of intermediate results.
The document introduces information retrieval and describes how an inverted index works as the key data structure for modern IR systems. An inverted index stores for each term a list of all documents that contain the term. It allows efficient processing of Boolean queries by merging the postings lists of query terms. Query processing aims to optimize the order of processing terms based on their document frequencies to minimize the size of intermediate results.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
The document discusses Boolean retrieval models and indexing in information retrieval systems. It describes how a Boolean retrieval model views documents as sets of words and uses Boolean operators like AND, OR, and NOT to join query terms. An inverted index data structure is built on the text to speed up searches by storing, for each term, the documents that contain it. The document provides examples of how Boolean queries are processed by intersecting postings lists from the inverted index to retrieve matching documents in linear time. It also discusses some limitations of the Boolean model and advantages of ranking search results.
This document discusses machine learning and information retrieval. It introduces machine learning and describes some common applications like bioinformatics, robotics, and computer vision. It then discusses information retrieval, including traditional keyword search approaches and a new example-based approach. Several prototype systems are described that use this example-based approach for tasks like movie search, academic literature search, image retrieval, and protein search. The approach is statistically principled, computationally fast, and easily parallelized.
1. DNA contains complex specified information that is organized in a digital code with syntax, semantics, and pragmatics. This information is used to build proteins and control processes in cells.
2. Modern science has not found a naturalistic explanation for how large amounts of specified digital information could arise. Intelligent design provides a better explanation than evolution by natural selection.
3. The intelligent designer behind DNA would need to be extremely intelligent to create such a complex system that exceeds human engineering abilities, and also have a sense of aesthetics to make nature delightful.
The Digital Library from Information Superhighway to the Semiotic WebMartin Kalfatovic
The document discusses the evolution of digital libraries and the internet, from early experiments and protocols like FTP and Gopher to the modern internet and role of libraries online. It covers the transition to personal computers and client-server models, the development of the World Wide Web and digital collections. The document also references challenges of implementing new technologies in libraries and transitioning positions for a digital world.
Information retrieval is the process of finding documents that satisfy an information need from within large collections. The inverted index is the key data structure underlying modern IR systems, where each term is associated with a list of documents containing that term. Boolean queries can be processed efficiently using the inverted index by merging the postings lists of query terms. Phrase queries require positional indexes, where each term's postings list also includes term positions within documents.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistRebecca Davis
As increasing amounts of humanities data comes online, scholars face new challenges in adapting traditional research, dissemination, and teaching practices. Without pretending to have all the answers, this presentation will address a constellation of related questions:
What do humanists gain from using new techniques for quick charting or mapping of their data?
How can we lower the technological barrier?
Does this compromise the deep analysis so valued in the humanities?
How is data in the humanities changing the relationship between researchers and archivists, as well as the nature of scholarly collaboration?
How does our evaluation of historical scholarship need to change? How much do algorithms and data literacy need to be a part of humanities courses?
What happens when we can’t understand where our data is coming from or what our digital tools are doing?
Fred Gibbs is an Assistant Professor of History at George Mason University and Director of Digital Scholarship at the Center for History and New Media.
This Digital Scholarship seminar will be facilitated by Kathryn Tomasek, Associate Professor of History at Wheaton College (MA) and will take place online in NITLE’s Virtual Auditorium. For more information, see our instructions on Participating in Online Events.
Artificial Intelligence is the study of how to make computers do things that currently require human intelligence. Researchers are creating systems that can mimic human thought, understand speech, beat the best human chess players, and more. Business Intelligence involves collecting and analyzing data to help businesses make informed decisions. It includes reporting, data mining, forecasting, and other analysis. Supply chain management involves coordinating the flow of materials and information between all parties involved in providing products to customers. Effective supply chain management can help lower costs and improve customer satisfaction.
In 3 sentences:
Databases were created to solve problems with file-oriented data storage systems by providing a standardized way to efficiently store and retrieve large amounts of structured data. A database allows data to be stored and accessed easily regardless of amount, avoiding issues like data redundancy, poor data control, and inability to manipulate data that plagued previous systems. Modern databases power applications from banking systems to e-commerce sites like Amazon that need to dynamically query large inventories and product listings.
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databaseseswcsummerschool
The document is a lecture on ontologies and databases given by Enrico Franconi. Some key points:
- An ontology is a formal conceptualization of a domain that specifies a set of constraints that define what must be true in any possible world.
- Ontology languages can range from simple to logic-based and are often expressed using diagrams. Conceptual data models like ER diagrams and UML class diagrams can also function as ontology languages.
- When querying a database via an ontology, the ontology acts as a mediator between the conceptual schema, logical schema, and data store. Deduction is used to infer additional constraints from what is specified in the ontology.
- Query answering over an ontology
The document provides an overview of the agenda for a day school session on computing topics. It will cover the evolution of computers, binary, HTML, and how to evaluate web resources. It introduces concepts like Moore's law, exponential growth, how computers store and process information using binary, and the basic anatomy of web addresses. Activities are included to have participants reflect on their first computer and how folding a piece of paper relates to exponential growth. Guidelines for evaluating online information using the PROMPT criteria of presentation, relevance, objectivity, method, provenance, and timeliness are also presented.
This document provides an overview of an introduction to artificial intelligence course, including:
- Course details such as the textbook, grading breakdown, and schedule
- Definitions and types of artificial intelligence including rational agents, the Turing test, and different branches of AI
- A brief history of ideas influencing AI such as philosophy, mathematics, psychology, and agents
- Examples of AI applications and challenges including ethics
Science, Technology and Society: The information age. All about the improvements of technology, how technology evolved, how science helped the technology and the society. And also how the life of the society makes easier because of science and technology. Science and Technology’s impact in todays world.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
The document discusses recommender systems and summarizes:
1) It introduces recommender systems and the different types including knowledge-based, collaborative filtering, and content-based recommendations.
2) It outlines some of the key resources for recommender systems including datasets, conferences, and articles.
3) It provides a high-level overview of common recommender system approaches like collaborative filtering, content-based analysis, and knowledge-based recommendations.
The document discusses the evolution of human intelligence through technological developments like the invention of writing 5000 years ago and printing over 1000 years ago. It describes how writing allowed information to be stored and transmitted independently of memory, enabling developments like organized religion and objective knowledge. Later sections discuss models for the exponential growth of computing power and how computers may be able to match and then exceed human brain capabilities through techniques like neural networks, detailed neuron modeling, and ability to share knowledge rapidly. The ability of computers to read and understand large amounts of information on their own is also described.
Blank Paper To Type On Blank Paper By MontroytanLisa Garcia
The document discusses the field of cosmology, which is the study of the universe from the largest galaxies down to the smallest atom, and how modern technology has allowed scientists to gain new insights into the structure and history of the universe with the goal of developing a unified theory. It provides context on the Copernican Revolution, Newton's Laws of Motion, and Einstein's theories of relativity as important milestones in the development of modern cosmological theories, and notes that today's advanced telescopes have enabled scientists to better understand the vast scale of the universe and how galaxies, stars, and planets form.
This document provides an overview of Boolean retrieval models for information retrieval. It discusses how Boolean queries using AND, OR, and NOT operators allow documents to be retrieved based on an exact match to the query terms. An inverted index representation maps terms to postings lists of document IDs. Boolean queries can be evaluated by merging the relevant postings lists. Query optimization techniques, such as processing terms in order of increasing document frequency, can improve efficiency. The document provides examples of Boolean queries from commercial search systems.
This document provides an overview and syllabus for an Artificial Intelligence course. It introduces the instructor, Dr. Zulfiqar Ali, and provides contact information. It lists the primary textbook as Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig and notes other reference materials. The course aims to provide understanding of fundamental AI techniques like agents, search, knowledge representation, and planning under uncertainty. It outlines the topics to be covered in each section of the course.
Resumes, Cover Letters, and Applying OnlineBruce Bennett
This webinar showcases resume styles and the elements that go into building your resume. Every job application requires unique skills, and this session will show you how to improve your resume to match the jobs to which you are applying. Additionally, we will discuss cover letters and learn about ideas to include. Every job application requires unique skills so learn ways to give you the best chance of success when applying for a new position. Learn how to take advantage of all the features when uploading a job application to a company’s applicant tracking system.
This document discusses machine learning and information retrieval. It introduces machine learning and describes some common applications like bioinformatics, robotics, and computer vision. It then discusses information retrieval, including traditional keyword search approaches and a new example-based approach. Several prototype systems are described that use this example-based approach for tasks like movie search, academic literature search, image retrieval, and protein search. The approach is statistically principled, computationally fast, and easily parallelized.
1. DNA contains complex specified information that is organized in a digital code with syntax, semantics, and pragmatics. This information is used to build proteins and control processes in cells.
2. Modern science has not found a naturalistic explanation for how large amounts of specified digital information could arise. Intelligent design provides a better explanation than evolution by natural selection.
3. The intelligent designer behind DNA would need to be extremely intelligent to create such a complex system that exceeds human engineering abilities, and also have a sense of aesthetics to make nature delightful.
The Digital Library from Information Superhighway to the Semiotic WebMartin Kalfatovic
The document discusses the evolution of digital libraries and the internet, from early experiments and protocols like FTP and Gopher to the modern internet and role of libraries online. It covers the transition to personal computers and client-server models, the development of the World Wide Web and digital collections. The document also references challenges of implementing new technologies in libraries and transitioning positions for a digital world.
Information retrieval is the process of finding documents that satisfy an information need from within large collections. The inverted index is the key data structure underlying modern IR systems, where each term is associated with a list of documents containing that term. Boolean queries can be processed efficiently using the inverted index by merging the postings lists of query terms. Phrase queries require positional indexes, where each term's postings list also includes term positions within documents.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
Digital Scholarship Seminar: Implications of Data for the 21st-century HumanistRebecca Davis
As increasing amounts of humanities data comes online, scholars face new challenges in adapting traditional research, dissemination, and teaching practices. Without pretending to have all the answers, this presentation will address a constellation of related questions:
What do humanists gain from using new techniques for quick charting or mapping of their data?
How can we lower the technological barrier?
Does this compromise the deep analysis so valued in the humanities?
How is data in the humanities changing the relationship between researchers and archivists, as well as the nature of scholarly collaboration?
How does our evaluation of historical scholarship need to change? How much do algorithms and data literacy need to be a part of humanities courses?
What happens when we can’t understand where our data is coming from or what our digital tools are doing?
Fred Gibbs is an Assistant Professor of History at George Mason University and Director of Digital Scholarship at the Center for History and New Media.
This Digital Scholarship seminar will be facilitated by Kathryn Tomasek, Associate Professor of History at Wheaton College (MA) and will take place online in NITLE’s Virtual Auditorium. For more information, see our instructions on Participating in Online Events.
Artificial Intelligence is the study of how to make computers do things that currently require human intelligence. Researchers are creating systems that can mimic human thought, understand speech, beat the best human chess players, and more. Business Intelligence involves collecting and analyzing data to help businesses make informed decisions. It includes reporting, data mining, forecasting, and other analysis. Supply chain management involves coordinating the flow of materials and information between all parties involved in providing products to customers. Effective supply chain management can help lower costs and improve customer satisfaction.
In 3 sentences:
Databases were created to solve problems with file-oriented data storage systems by providing a standardized way to efficiently store and retrieve large amounts of structured data. A database allows data to be stored and accessed easily regardless of amount, avoiding issues like data redundancy, poor data control, and inability to manipulate data that plagued previous systems. Modern databases power applications from banking systems to e-commerce sites like Amazon that need to dynamically query large inventories and product listings.
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databaseseswcsummerschool
The document is a lecture on ontologies and databases given by Enrico Franconi. Some key points:
- An ontology is a formal conceptualization of a domain that specifies a set of constraints that define what must be true in any possible world.
- Ontology languages can range from simple to logic-based and are often expressed using diagrams. Conceptual data models like ER diagrams and UML class diagrams can also function as ontology languages.
- When querying a database via an ontology, the ontology acts as a mediator between the conceptual schema, logical schema, and data store. Deduction is used to infer additional constraints from what is specified in the ontology.
- Query answering over an ontology
The document provides an overview of the agenda for a day school session on computing topics. It will cover the evolution of computers, binary, HTML, and how to evaluate web resources. It introduces concepts like Moore's law, exponential growth, how computers store and process information using binary, and the basic anatomy of web addresses. Activities are included to have participants reflect on their first computer and how folding a piece of paper relates to exponential growth. Guidelines for evaluating online information using the PROMPT criteria of presentation, relevance, objectivity, method, provenance, and timeliness are also presented.
This document provides an overview of an introduction to artificial intelligence course, including:
- Course details such as the textbook, grading breakdown, and schedule
- Definitions and types of artificial intelligence including rational agents, the Turing test, and different branches of AI
- A brief history of ideas influencing AI such as philosophy, mathematics, psychology, and agents
- Examples of AI applications and challenges including ethics
Science, Technology and Society: The information age. All about the improvements of technology, how technology evolved, how science helped the technology and the society. And also how the life of the society makes easier because of science and technology. Science and Technology’s impact in todays world.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
The document discusses recommender systems and summarizes:
1) It introduces recommender systems and the different types including knowledge-based, collaborative filtering, and content-based recommendations.
2) It outlines some of the key resources for recommender systems including datasets, conferences, and articles.
3) It provides a high-level overview of common recommender system approaches like collaborative filtering, content-based analysis, and knowledge-based recommendations.
The document discusses the evolution of human intelligence through technological developments like the invention of writing 5000 years ago and printing over 1000 years ago. It describes how writing allowed information to be stored and transmitted independently of memory, enabling developments like organized religion and objective knowledge. Later sections discuss models for the exponential growth of computing power and how computers may be able to match and then exceed human brain capabilities through techniques like neural networks, detailed neuron modeling, and ability to share knowledge rapidly. The ability of computers to read and understand large amounts of information on their own is also described.
Blank Paper To Type On Blank Paper By MontroytanLisa Garcia
The document discusses the field of cosmology, which is the study of the universe from the largest galaxies down to the smallest atom, and how modern technology has allowed scientists to gain new insights into the structure and history of the universe with the goal of developing a unified theory. It provides context on the Copernican Revolution, Newton's Laws of Motion, and Einstein's theories of relativity as important milestones in the development of modern cosmological theories, and notes that today's advanced telescopes have enabled scientists to better understand the vast scale of the universe and how galaxies, stars, and planets form.
This document provides an overview of Boolean retrieval models for information retrieval. It discusses how Boolean queries using AND, OR, and NOT operators allow documents to be retrieved based on an exact match to the query terms. An inverted index representation maps terms to postings lists of document IDs. Boolean queries can be evaluated by merging the relevant postings lists. Query optimization techniques, such as processing terms in order of increasing document frequency, can improve efficiency. The document provides examples of Boolean queries from commercial search systems.
This document provides an overview and syllabus for an Artificial Intelligence course. It introduces the instructor, Dr. Zulfiqar Ali, and provides contact information. It lists the primary textbook as Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig and notes other reference materials. The course aims to provide understanding of fundamental AI techniques like agents, search, knowledge representation, and planning under uncertainty. It outlines the topics to be covered in each section of the course.
Resumes, Cover Letters, and Applying OnlineBruce Bennett
This webinar showcases resume styles and the elements that go into building your resume. Every job application requires unique skills, and this session will show you how to improve your resume to match the jobs to which you are applying. Additionally, we will discuss cover letters and learn about ideas to include. Every job application requires unique skills so learn ways to give you the best chance of success when applying for a new position. Learn how to take advantage of all the features when uploading a job application to a company’s applicant tracking system.
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...dsnow9802
Jill Pizzola's tenure as Senior Talent Acquisition Partner at THOMSON REUTERS in Marlton, New Jersey, from 2018 to 2023, was marked by innovation and excellence.
IT Career Hacks Navigate the Tech Jungle with a RoadmapBase Camp
Feeling overwhelmed by IT options? This presentation unlocks your personalized roadmap! Learn key skills, explore career paths & build your IT dream job strategy. Visit now & navigate the tech world with confidence! Visit https://www.basecamp.com.sg for more details.
Joyce M Sullivan, Founder & CEO of SocMediaFin, Inc. shares her "Five Questions - The Story of You", "Reflections - What Matters to You?" and "The Three Circle Exercise" to guide those evaluating what their next move may be in their careers.
Job Finding Apps Everything You Need to Know in 2024SnapJob
SnapJob is revolutionizing the way people connect with work opportunities and find talented professionals for their projects. Find your dream job with ease using the best job finding apps. Discover top-rated apps that connect you with employers, provide personalized job recommendations, and streamline the application process. Explore features, ratings, and reviews to find the app that suits your needs and helps you land your next opportunity.
Learnings from Successful Jobs SearchersBruce Bennett
Are you interested to know what actions help in a job search? This webinar is the summary of several individuals who discussed their job search journey for others to follow. You will learn there are common actions that helped them succeed in their quest for gainful employment.
5 Common Mistakes to Avoid During the Job Application Process.pdfAlliance Jobs
The journey toward landing your dream job can be both exhilarating and nerve-wracking. As you navigate through the intricate web of job applications, interviews, and follow-ups, it’s crucial to steer clear of common pitfalls that could hinder your chances. Let’s delve into some of the most frequent mistakes applicants make during the job application process and explore how you can sidestep them. Plus, we’ll highlight how Alliance Job Search can enhance your local job hunt.
Leadership Ambassador club Adventist modulekakomaeric00
Aims to equip people who aspire to become leaders with good qualities,and with Christian values and morals as per Biblical teachings.The you who aspire to be leaders should first read and understand what the ambassador module for leadership says about leadership and marry that to what the bible says.Christians sh
1. Lecture 1: Introduction and the Boolean Model
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Natural Language and Information Processing (NLIP) Group
Simone.Teufel@cl.cam.ac.uk
1
2. Overview
1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now
2 First Boolean Example
Term-Document Incidence matrix
The inverted index
Processing Boolean Queries
Practicalities of Boolean Search
3. What is Information Retrieval?
Manning et al, 2008:
Information retrieval (IR) is finding material . . . of an unstructured
nature . . . that satisfies an information need from within large
collections . . . .
2
4. What is Information Retrieval?
Manning et al, 2008:
Information retrieval (IR) is finding material . . . of an unstructured
nature . . . that satisfies an information need from within large
collections . . . .
3
6. Document Collections
IR in the 17th century: Samuel Pepys, the famous English diarist,
subject-indexed his treasured 1000+ books library with key words.
5
8. What we mean here by document collections
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents)
of an unstructured nature . . . that satisfies an information need
from within large collections (usually stored on computers).
Document Collection: text units we have built an IR system
over.
Usually documents
But could be
memos
book chapters
paragraphs
scenes of a movie
turns in a conversation...
Lots of them
7
11. What is Information Retrieval?
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents)
of an unstructured nature . . . that satisfies an information need
from within large collections (usually stored on computers.
10
12. Structured vs Unstructured Data
Unstructured data means that a formal, semantically overt,
easy-for-computer structure is missing.
In contrast to the rigidly structured data used in DB style
searching (e.g. product inventories, personnel records)
SELECT *
FROM business catalogue
WHERE category = ’florist’
AND city zip = ’cb1’
This does not mean that there is no structure in the data
Document structure (headings, paragraphs, lists. . . )
Explicit markup formatting (e.g. in HTML, XML. . . )
Linguistic structure (latent, hidden)
11
13. Information Needs and Relevance
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
An information need is the topic about which the user desires
to know more about.
A query is what the user conveys to the computer in an
attempt to communicate the information need.
A document is relevant if the user perceives that it contains
information of value with respect to their personal information
need.
12
14. Types of information needs
Manning et al, 2008:
Information retrieval (IR) is finding material . . . of an unstructured
nature . . . that satisfies an information need from within large
collections . . . .
Known-item search
Precise information seeking search
Open-ended search (“topical search”)
13
15. Information scarcity vs. information abundance
Information scarcity problem (or needle-in-haystack problem):
hard to find rare information
Lord Byron’s first words? 3 years old? Long sentence to the
nurse in perfect English?
. . . when a servant had spilled an urn of hot coffee over his legs, he replied to
the distressed inquiries of the lady of the house, ’Thank you, madam, the
agony is somewhat abated.’ [not Lord Byron, but Lord Macaulay]
Information abundance problem (for more clear-cut
information needs): redundancy of obvious information
What is toxoplasmosis?
14
16. Relevance
Manning et al, 2008:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
Are the retrieved documents
about the target subject
up-to-date?
from a trusted source?
satisfying the user’s needs?
How should we rank documents in terms of these factors?
More on this in a lecture soon
15
17. How well has the system performed?
The effectiveness of an IR system (i.e., the quality of its search
results) is determined by two key statistics about the system’s
returned results for a query:
Precision: What fraction of the returned results are relevant to
the information need?
Recall: What fraction of the relevant documents in the
collection were returned by the system?
What is the best balance between the two?
Easy to get perfect recall: just retrieve everything
Easy to get good precision: retrieve only the most relevant
There is much more to say about this – lecture 5
16
18. IR today
Web search ( )
Search ground are billions of documents on millions of
computers
issues: spidering; efficient indexing and search; malicious
manipulation to boost search engine rankings
Link analysis covered in Lecture 8
Enterprise and institutional search ( )
e.g company’s documentation, patents, research articles
often domain-specific
Centralised storage; dedicated machines for search.
Most prevalent IR evaluation scenario: US intelligence analyst’s
searches
Personal information retrieval (email, pers. documents; )
e.g., Mac OS X Spotlight; Windows’ Instant Search
Issues: different file types; maintenance-free, lightweight to run
in background
17
19. A short history of IR
1945 1950s 1960s 1970s
1980s
1990s 2000s
memex
Term
IR coined
by Calvin
Moers
Literature
searching
systems;
evaluation
by P&R
(Alan Kent)
Cranfield
experiments
Boolean
IR
SMART
1
0
recall
precision
no items retrieved
precision/
recall
Salton;
VSM
pagerank
TREC
Multimedia
Multilingual
(CLEF)
Recommendation
Systems
18
22. Areas of IR
“Ad hoc” retrieval (lectures 1-5)
web retrieval (lecture 8)
Support for browsing and filtering document collections:
Clustering (lecture 6)
Classification; using fixed labels (common information needs,
age groups, topics; lecture 7)
Further processing a set of retrieved documents, e.g., by using
natural language processing
Information extraction
Summarisation
Question answering
21
23. Overview
1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now
2 First Boolean Example
Term-Document Incidence matrix
The inverted index
Processing Boolean Queries
Practicalities of Boolean Search
24. Boolean Retrieval
In the Boolean retrieval model we can pose any query in the
form of a Boolean expression of terms
i.e., one in which terms are combined with the operators and,
or, and not.
Shakespeare example
22
25. Brutus AND Caesar AND NOT Calpurnia
Which plays of Shakespeare contain the words Brutus and
Caesar, but not Calpurnia?
Naive solution: linear scan through all text – “grepping”
In this case, works OK (Shakespeare’s Collected works has less
than 1M words).
But in the general case, with much larger text colletions, we
need to index.
Indexing is an offline operation that collects data about which
words occur in a text, so that at search time you only have to
access the precompiled index.
23
26. The term-document incidence matrix
Main idea: record for each document whether it contains each
word out of all the different words Shakespeare used (about 32K).
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
Matrix element (t, d) is 1 if the play in column d contains the
word in row t, 0 otherwise.
24
27. Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
This returns two documents, “Antony and Cleopatra” and
“Hamlet”.
25
28. Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
This returns two documents, “Antony and Cleopatra” and
“Hamlet”.
26
29. Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony
and
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
AND 1 0 0 1 0 0
Bitwise AND returns two documents, “Antony and Cleopatra” and
“Hamlet”.
27
30. The results: two documents
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to Dominitus Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring, and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar: I was killed i’ the
Capitol; Brutus killed me.
28
31. Bigger collections
Consider N=106 documents, each with about 1000 tokens
109 tokens at avg 6 Bytes per token ⇒ 6GB
Assume there are M=500,000 distinct terms in the collection
Size of incidence matrix is then 500,000 ×106
Half a trillion 0s and 1s
29
32. Can’t build the Term-Document incidence matrix
Observation: the term-document matrix is very sparse
Contains no more than one billion 1s.
Better representation: only represent the things that do occur
Term-document matrix has other disadvantages, such as lack
of support for more complex query operators (e.g., proximity
search)
We will move towards richer representations, beginning with
the inverted index.
30
33. The inverted index
The inverted index consists of
a dictionary of terms (also: lexicon, vocabulary)
and a postings list for each term, i.e., a list that records which
documents the term occurs in.
Brutus 1 2 4 45
31
11 174
173
Caesar 132
1 2 4 5 6 16 57
Calpurnia 54 101
2 31
179
31
34. Processing Boolean Queries: conjunctive queries
Our Boolean Query
Brutus AND Calpurnia
Locate the postings lists of both query terms and intersect them.
Brutus 1 2 4 45
31
11 174
173
54 101
2 31
Calpurnia
Intersection 2 31
Note: this only works if postings lists are sorted
32
35. Algorithm for intersection of two postings
INTERSECT (p1, p2)
1 answer ← <>
2 while p1 6= NIL and p2 6= NIL
3 do if docID(p1) = docID(p2)
4 then ADD (answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 if docID(p1) < docID(p2)
8 then p1← next(p1)
9 else p2← next(p2)
10 return answer
Brutus 1 2 4 45
31
11 174
173
54 101
2 31
Calpurnia
Intersection 2 31
33
36. Complexity of the Intersection Algorithm
Bounded by worst-case length of postings lists
Thus “officially” O(N), with N the number of documents in
the document collection
But in practice much, much better than linear scanning,
which is asymptotically also O(N)
34
37. Query Optimisation: conjunctive terms
Organise order in which the postings lists are accessed so that least
work needs to be done
Brutus AND Caesar AND Calpurnia
Process terms in increasing document frequency: execute as
(Calpurnia AND Brutus) AND Caesar
Brutus 1 2 4 45
31
11 174
173
Caesar 132
1 2 4 5 6 16 57
Calpurnia 54 101
2 31
8
9
4
179
35
38. Query Optimisation: disjunctive terms
(maddening OR crowd) AND (ignoble OR strife) AND (killed OR slain)
Process the query in increasing order of the size of each
disjunctive term
Estimate this in turn (conservatively) by the sum of
frequencies of its disjuncts
36
39. Practical Boolean Search
Provided by large commercial information providers
1960s-1990s
Complex query language; complex and long queries
Extended Boolean retrieval models with additional operators –
proximity operators
Proximity operator: two terms must occur close together in a
document (in terms of certain number of words, or within
sentence or paragraph)
Unordered results...
37
40. Example: Westlaw
Largest commercial legal search service – 500K subscribers
Boolean Search and ranked retrieval both offered
Document ranking only wrt chronological order
Expert queries are carefully defined and incrementally
developed
38
41. Westlaw Queries/Information Needs
“trade secret” /s disclos! /s prevent /s employe!
Information need: Information on the legal theories involved in
preventing the disclosure of trade secrets by employees formerly
employed by a competing company.
disab! /p access! /s work-site work-place (employment /3 place)
Information need: Requirements for disabled people to be able to
access a workplace.
host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
Information need: Cases about a host’s responsibility for drunk
guests.
39
42. Comments on WestLaw
Proximity operators: /3= within 3 words, /s=within same
sentence /p =within a paragraph
Space is disjunction, not conjunction (This was standard in
search pre-Google.)
Long, precise queries: incrementally developed, unlike web
search
Why professional searchers like Boolean queries: precision,
transparency, control.
40
43. Does Google use the Boolean Model?
On Google, the default interpretation of a query [w1 w2 ... wn] is
w1 AND w2 AND ... AND wn
Cases where you get hits which don’t contain one of the w−i:
Page contains variant of wi (morphology, misspelling,
synonym)
long query (n is large)
Boolean expression generates very few hits
wi was in the anchor text
Google also ranks the result set
Simple Boolean Retrieval returns matching documents in no
particular order.
Google (and most well-designed Boolean engines) rank hits
according to some estimator of relevance
41