INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. ... The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
ASEAN-JAPAN Cyber Security Seminar: How to fill your team gaps with trainingAPNIC
APNIC Senior Security Specialist Adli Wahid presents on identifying skill gaps and how to meet them at the ASEAN-JAPAN Cyber Security Seminar, held online on 11 August 2021.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. ... The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
ASEAN-JAPAN Cyber Security Seminar: How to fill your team gaps with trainingAPNIC
APNIC Senior Security Specialist Adli Wahid presents on identifying skill gaps and how to meet them at the ASEAN-JAPAN Cyber Security Seminar, held online on 11 August 2021.
Enterprise Search – How Relevant Is Relevance?Sease
Enterprise search is the outlier in search applications. It has to work effectively with very large collections of un-curated content, often in multiple languages, to meet the requirements of employees who need to make business-critical decisions.
In this talk, I will outline the challenges of searching enterprise content. Recent research is revealing a unique pattern of search behaviour in which relevance is both very important and yet also irrelevant, and where recall is just as important as precision. This behaviour has implications for the use of standard metrics for search performance (especially in the case of federated search across multiple applications) and for the adoption of AI/ML techniques.
Community Capability Model Framework Checklist Tool - Demo & ReviewManjulaPatel
Presented by Manjula Patel (UKOLN, University of Bath) on 14th January 2013, Community Capability Framework for Data-Intensive Research - Applying the Model, CCMDIR Workshop, International Digital Curation Conference 2013, Amsterdam
This presentation is an introduction to the field of data mining beginning with why you should know about data mining, also with examples of applications, and the relationship of data mining and knowledge discovering, and from there to compare data mining versus process mining.
INFORMATION RESOURCES MANAGEMENT UNDER INDUSTRY-INSTITUTE PARTNERSHIP: A Case...Bhojaraju Gunjal
Gunjal, Bhojaraju., Choukimath, PA and Agadi, KB (2003). Information Resources Management under Industry-Institute Partnership: A study on IRMRA-TSR-PIIT Library, In Proceedings of BOSLA Seminar, TISS, Nov. 8, 2003, Mumbai.
In the information age, data turns to be the vital. Hence it is important to understand the data in order to face the future information challenges. This paper deals with the importance of data mining while explaining the concepts and life cycle involved. It extracts the basic gist of the topic presented in a user-friendly way. Further, in developing different stages of data mining followed by its extended application usage in practical business platform.
B.E. / B.TECH. DEGREE
SEMESTER - VIII
PROFESSIONAL ELECTIVE - V
CS8080 INFORMATION RETRIEVAL TECHNIQUES
UNIT - III - TEXT CLASSIFICATION AND CLUSTERING
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Chapter 3 - Islamic Banking Products and Services.pptx
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
1. P1WU
UNIT – I: INTRODUCTION
TOPIC -1 : INFORMATION RETRIEVAL
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
3. INTRODUCTION TO INFORMATION RETRIEVAL
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
4. What is IR?
• Information retrieval (IR) is finding material . . .
• of an unstructured nature . . .
• that satisfies an information need from within large collections .
. . . (usually stored on computers).
• Unstructured data means that
• a formal, semantically overt, easy-for-computer structure is missing.
• In contrast to the rigidly structured data used in DB style searching
(e.g. product inventories, personnel records)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
5. What is IR?
• The process of actively seeking out information relevant
to a topic of interest (van Rijsbergen)
• Typically it refers to the automatic (rather thanmanual)
retrieval of documents
• Information Retrieval System (IRS)
• “Document” is the generic term for an information
• holder (book, chapter, article, webpage, etc)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
6. What is IR?
• Information retrieval is
the science of searching for information
a) in a document,
b) searching for documents themselves, and
c) also searching for the metadata that describes data, and
d) for databases of texts, images or sounds.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
7. What is IR?
• Information Retrieval (IR) can be defined as
• a software program that deals with the organization, storage,
retrieval, and evaluation of information from document
repositories, particularly textual information.
• Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature
• i.e. usually text which satisfies an information need from
within large collections which is stored on computers.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
8. What is IR?
• IR helps users
• find information that matches their information needs expressed as queries.
• Historically, IR is about document retrieval, emphasizing
document as the basic unit. – Finding documents relevant to
user queries
• Technically, IR studies the acquisition, organization, storage,
retrieval, and distribution of information.
• For example, Information Retrieval can be when a user
enters a query into the system.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
9. What is IR?
• Information retrieval (IR) is a broad area of
Computer Science focused
• primarily on providing the users with easy access
to information of their interest, as follows.
• Information retrieval deals with
• the representation,
• storage,
• organization of, and
• access to information items such as documents,
Web pages, online catalogs, structured and semi-
structured records, multimedia objects.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
10. What is IR?
• Information retrieval (IR)
in computing and information science is the process of
obtaining information system resources that are
relevant to an information need from a collection of
those resources. Searches can be based on full-text or
other content-based indexing.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
11. What is IR?
• Information retrieval (IR) Quality :
• Are the retrieved documents about the target subject up-
to-date?
• from a trusted source?
• satisfying the user’s needs?
• How should we rank documents in terms of these factors?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
12. What are IR Types?
• Information retrieval (IR) are of two types:
1. Precision and
2. recall
Above are the two parameters of retrieval effectiveness.
1. Precision refers to how many of the retrieved documents are
relevant to the user.
2. Recall refers to what fraction of relevant documents in the collection
are retrieved.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
13. What are the types of information retrieval?
• Methods/Techniques in which information retrieval
techniques are employed include:
• Adversarial information retrieval.
• Automatic summarization. Multi-document summarization.
• Compound term processing.
• Cross-lingual retrieval.
• Document classification.
• Spam filtering.
• Question answering.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
14. Why we do IR?
• Information retrieval can provide organizations with
immediate value
• --while it's important to try to figure out ways to capture
tacit knowledge,
• information retrieval provides a means to get at
information that already exists in electronic formats
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
15. Why we do IR?
• Information retrieval can provide organizations with immediate value--while it's important to try to
figure out ways to capture tacit knowledge, information retrieval provides a means to get at
information that already exists in electronic formats.
• The representation and organization of the information items should be such as to
• provide the users with easy access to information of their interest.
• Nowadays, research in IR includes
• modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering,
languages.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
16. What is the need for information retrieval?
•Information retrieval can provide :
•organizations with immediate value
--while it's important to try to figure out ways to capture
tacit knowledge, information retrieval provides
a means to get at information that already
exists in electronic formats.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
17. What is the need for information retrieval in WEB?
• Web: – A huge, widely-distributed, highly heterogeneous,
semistructured,, interconnected, evolving,
hypertext/hypermedia information repository
• Main issues – Abundance of information
• The 99% of all the information are not interesting for the 99% of
all users – The static Web is a very small part of all the Web.
• Dynamic Website – To access the Web user need to exploit Search
Engines (SE)
• SE must be improved
• To help people to better formulate their information needs
• More personalization is needed
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
18. WEB IR : What is Web IR?
• Web IR can be defined as
• the application of theories and methodologies from IR to the World Wide Web.
Web Information Retrieval models are ways of integrating many sources of
evidence about documents,
1. the links,
2. the structure of the document,
3. the actual content of the document,
4. the quality of the document, etc.
so that an effective Web search engine can be achieved.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
19. Issues in IR
•The main issues of the
Information Retrieval
(IR) are :
1. Document and Query
Indexing
2. Query Evaluation
3. System Evaluation.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
20. Issues in IR : Document and Query Indexing
• Document and Query Indexing
Main goal of Document and
Query Indexing is to find
important meanings and
creating an internal
representation.
• The factors to be considered
are accuracy to represent
semantics, exhaustiveness, and
facility for a computer to
manipulate.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
21. Issues in IR : Query Evaluation
• Query Evaluation –
In the retrieval model how can a document be
represented with the selected keywords and
how are documents and query representations
compared to calculate a score.
• Information Retrieval (IR) deals with issues like
uncertainty and vagueness in information
systems.
• Uncertainty :
The available representation does not typically reflect
true semantics of objects such as images, videos etc.
• Vagueness :
The information that the user requires lacks clarity, is
only vaguely expressed in a query, feedback or user
action.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
22. Issues in IR : System Evaluation
• System Evaluation –
System Evaluation tells about the
importance of determining the impact of
information given on user achievement.
Here, we see if the efficiency of the
particular system related to time and
space.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
23. IR in practice
• Not only librarians, professional searchers, etc engage
themselves in the activity of information retrieval
• but nowadays hundreds of millions of people engage in IR every day when
they use web search engines.
• Information Retrieval is believed to be the dominant form of
Information access
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
24. IR in practice
• Information Retrieval is a research-driven theoretical and
experimental discipline
• The focus is on different aspects of the information–
seeking process, depending on the researcher’s
background or interest:
• Computer scientist – fast and accurate search engine
• Librarian – organization and indexing of information
• Cognitive scientist – the process in the searcher’s mind
• Philosopher – Is this really relevant ?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
25. IR in practice
• Progress influenced by advances in Computational
Linguistics, Information Visualization, Cognitive
Psychology, HCI, …
• Experimental vs. operational systems
• Analogy to wcawrmwa.nsutfuadcteurnintgsf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
26. IR Process
• An information retrieval process begins when a user enters a
query into the system.
• Queries are formal statements of information needs, for
example search strings in web search engines. In information
retrieval a query does not uniquely identify a single object in
the collection.
• Instead, several objects may match the query, perhaps with
different degrees of relevancy.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
27. Fundamental concepts in IR
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
28. Fundamental concepts in IR
• What is information ?
• Meaning vs. form
• Data vs. Information Retrieval
• Relevance
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
29. The stages of IR
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
30. FORMULATION OF IR PROCESS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
31. IR system
• The IR system assists the users in finding the information they
require
• but it does not explicitly return the answers to the question.
• It notifies regarding the existence and location of documents
that might consist of the required information.
• Information retrieval also extends support to users in browsing
or filtering document collection or processing a set of retrieved
documents.
• The system searches over billions of documents stored on
millions of computers.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
32. IR system
• An IR system has the ability to represent, store, organize, and access
information items.
• A set of keywords are required to search. Keywords are what people
are searching for in search engines.
• These keywords summarize the description of the information.
• A spam filter, manual or automatic means are provided by Email
program for classifying the mails so that it can be placed directly into
particular folders.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
33. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
34. P1WU
UNIT – I: INTRODUCTION
TOPIC – 2: EARLY DEVELOPMENTS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
35. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
36. EARLY DEVELOPMENTS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
37. EARLY DEVELOPMENTS
• For more than 5, 000 years, man has organized information f or later
retrieval and searching.
• In its most usual form, this has been done by
• compiling,
• storing,
• organizing, and
• indexing clay tablets,
• hieroglyphics,
• papyrus rolls, and
• books.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
38. EARLY DEVELOPMENTS
• IR in the 17th
century: Samuel
Pepys, the famous
English diarist,
subject-indexed his
treasured 1000+
books library with
key words.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
39. EARLY DEVELOPMENTS
• IR in the 17th
century: Samuel
Pepys, the famous
English diarist,
subject-indexed his
treasured 1000+
books library with
key words.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
40. EARLY DEVELOPMENTS
• Document Collection: text units we have built an IR system
over.
• Usually documents
• But could be
• memos
• book chapters paragraphs scenes of a movie
• turns in a conversation...
• Lots of them
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
41. EARLY DEVELOPMENTS
• For holding the various items,
• special purpose buildings called libraries,
• from the Latin word liber for book, or bibliothekes,
• from the Greek word biblion for papyrus roll, are used.
• The oldest known library was created in Elba,
• in the “Fertile Crescent”,
• currently northern Syria,
• some time between 3,000 and 2,500 BC.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
42. EARLY DEVELOPMENTS
• In the seventh century BC,
• Assyrian king Ashurbanipal created the library of Nineveh, on the
Tigris River (today, north of Iraq),
• which contained more than 30,000 clay tablets at the time of its destruction in
612 BC.
• By 300 BC, Ptolemy Soter, a Macedonian general, created the Great
Library in Alexandria – the Egyptian city at the mouth of the Nile
named after
• the Macedonian king Alexander the Great (356-323 BC).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
43. EARLY DEVELOPMENTS
• For seven centuries the Great Library, jointly with other major libraries
in the city,
• made Alexandria the intellectual capital of the Western world.
• Since then, libraries have expanded and flourished. Nowadays, they
are everywhere.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
44. EARLY DEVELOPMENTS
• . They constitute the collective memory of the human race and their
popularity is in the rising.
• In 2008 alone, people in the US visited their libraries some 1.3 billion
times and checked out more than 2 billion items
• an increase in both yearly figures of more than 10 percent.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
45. EARLY DEVELOPMENTS
• Since the volume of information in libraries is always growing,
• it is necessary to build specialized data structures for fast search – the indexes.
In one form or another,
• indexes are at the core of every modern information retrieval system.
• They provide fast access to the data and allow speeding up query
processing.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
46. EARLY DEVELOPMENTS
• For centuries indexes have been created manually as sets of categories.
• Each category in the index is typically composed of labels that identify its
associated topics and of pointers to the documents that discuss those
topics.
• While these indexes are usually designed by library and information science
researchers,
• the advent of modern computers has allowed the construction of large indexes
automatically,
• which has accelerated the development of the area of Information Retrieval (IR).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
47. EARLY DEVELOPMENTS
• Early developments in IR date back to research efforts conducted in the 50’s
by pioneers such as
• Hans Peter Luhn,
• Eugene Garfield,
• Philip Bagley, and
• Cal vi n Moores,
• this last one having allegedly coined the term information retrieval.
• In 1955, Allen Kent and colleagues published a paper describing the
precision and recall metrics,
• which was followed by the publication in 1962 of the Cranfield studies by Cyril
Cleverdon.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
48. EARLY DEVELOPMENTS
• In 1963, Joseph Becker and Robert Hayes published the first book on
information retrieval [164].
• Throughout the 60’s, Gerard Salton and Karen Sparck Jones, among
others, shaped the field
• by developing the fundamental concepts that led to the modern technologies
of ranking in IR.
• In1968,thefirstIR book authored by Salton was published.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
49. EARLY DEVELOPMENTS
• . In 1971, N.Jardine and C.J.VanRijsbergen articulated the “cluster
hypothesis”.
• In 1978, the first ACM Conference on IR (ACM SIGIR) was held in
Rochester, New York.
• In 1979, C.J. Van Rijsbergen published Information Retrieval, which
focused on probabilistic models.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
50. EARLY DEVELOPMENTS
• In 1983, Salton and McGill published Introduction to Modern
Information Retrieval, a classic book on IR focused on vector models.
• Since then,
• the IR community has grown to include
• thousands of professors,
• researchers,
• students,
• engineers, and
• practitioners
• throughout the world.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
51. EARLY DEVELOPMENTS
• The main conference in the area,
• the ACM International Conference on Information Retrieval (ACM SIGIR),
• now attracts hundreds of attendees and receives hundreds of submitted
papers on an yearly basis.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
52. MODERN IR DEVELOPMENTS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
53. MODERN IR DEVELOPMENTS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
54. MODERN IR DEVELOPMENTS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
55. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
56. P1WU
UNIT – I: INTRODUCTION
Topic 3: THE IR PROBELM
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
57. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
58. INTRODUCTION TO INFORMATION RETRIEVAL (IR) PROBELM
What is IR PROBLEM?
Users of modern IR systems, such as search engine users, have
information needs of varying complexity.
In the simplest case, they are looking for the link to the homepage of a
company, government, or institution.
In the more sophisticated cases, they are looking for information
required to execute tasks associated with their jobs or immediate
needs.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
59. IR PROBLEM EXAMPLE
An example of a more complex information need is as
follows:
• Find all documents that address the role of the Federal
Government in financing the operation of the National Railroad
Transportation Corporation (AMTRAK).
• This full description of the user need does not necessarily provide
• the best formulation for querying the IR system.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
60. IR PROBLEM EXAMPLE
• Instead, the user might want to first translate this
information need into
• a query, or sequence of queries, to be posed to the
system.
• In its most common form, this translation yields
• a set of keywords, or index terms, which summarize the
user information need.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
61. IR PROBLEM EXAMPLE
• Given the user query,
• the key goal of the IR system is to retrieve information that is useful or relevant to the user.
• The emphasis is on the retrieval of information as opposed to the
retrieval of data.
• To be effective in its attempt to satisfy the user information need,
• the IR system must somehow ‘interpret’ the contents of the information items.
• That is, the documents in a collection, and rank them according to a
degree of relevance to the user query.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
62. IR PROBLEM
• Given the user query,
• the key goal of the IR system is to retrieve information that is useful or relevant to the user.
• The emphasis is on the retrieval of information as opposed to the
retrieval of data.
• To be effective in its attempt to satisfy the user information need,
• the IR system must somehow ‘interpret’ the contents of the information items.
• That is, the documents in a collection, and rank them according to a
degree of relevance to the user query.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
63. IR PROBLEM
• This ‘interpretation’ of a document content involves
extracting syntactic and semantic information from the
document text and using this information to match the user
information need.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
64. IR PROBLEM
• The IR Problem: the primary goal of an IR system is to
• retrieve all the documents that are relevant to a user query while retrieving as
few non relevant documents as possible.
• The difficulty is knowing not only how to extract information
from the documents
• but also knowing how to use it to decide relevance.
• That is, the notion of relevance is of central importance in IR.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
65. IR PROBLEM
• One main issue is that relevance is a personal assessment that depends on the
task being solved and its context.
• For example:
• Relevance can change
• with time (e.g., new information becomes available),
• with location (e.g., the most relevant answer is the closest one), or
• even with the device (e.g., the best answer is a short document that is easier to download
and visualize).
• In this sense, no IR system can provide perfect answers to all users all the time.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
66. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
67. P1WU
UNIT – I: INTRODUCTION
Topic 4:THE USERS TASK
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
68. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
69. AN OVERVIEW OF USER TASK : What is USER TASK IN IRS?
• The user of a retrieval system has to :
• translate his information need into a query in the language provided by the system.
• With an information retrieval system, this normally implies
• specifying a set of words which convey the semantics of the information need.
• With a data retrieval system,
• a query expression (such as, for instance, a regular expression) is used to
• convey the constraints that must be satisfied by objects in the answer set.
• In both cases, we say that the user searches for useful information executing
a retrieval task.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
70. IRS – USER TASK : What is USER TASK IN IRS?
• Users of modern IR systems,
• such as search engine users, have information needs of varying complexity.
• The user of a retrieval system has to translate their information need into a query in
the language provided by the system.
• With an IR system, such as a search engine,
• this usually implies specifying a set of words that convey the semantics of the
information need.
• We say that the user is searching or querying for information of their interest.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
71. IRS – USER TASK EXAMPLE
To illustrate,
• The user might be interested in documents about
• car racing in general and might decide to glance related documents about Formula 1 racing,
• Formula Indy, and the ‘24 Hours of Le Mans.
• We say that
• the user is browsing or navigating the documents in the collection, not searching.
• It is still a process of retrieving information,
• but one whose main objectives are less clearly defined in the beginning.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
72. IRS – USER TASK
• While searching for information of interest is the main
retrieval task on the Web,
• search can also be used for satisfying other user needs distinct from
information access,
• such as the buying of goods and the placing of reservations.
• Consider now a user
• who has an interest that is either poorly defined or inherently
broad,
• such that the query to specify is unclear.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
73. IRS – USER TASK
• The task in this case is more related to
• exploratory search and resembles a process of quasi-sequential search for
information of interest.
• Here we, make a clear distinction between the different tasks
the user of the retrieval system might be engaged in.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
74. IRS – USER TASK
• The task might be then of two distinct types: searching and
browsing, as illustrated in Figure:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
75. IRS – USER TASK
• In a process of retrieving
information, one whose
main objectives are not
clearly defined in the
beginning and whose
purpose might change
during the interaction with
the system.
• Then, user task may go
with Browsing only.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
76. IRS – USER TASK
• User Choice of
Information
Retrieval:
• Push
• Pull
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
77. IRS – USER TASK
• Both retrieval and browsing are, in the language of the World Wide
Web, `pulling' actions.
• That is, the user requests the information in an interactive manner.
• An alternative is to do retrieval in an automatic and permanent
fashion using software agents which push the information towards
the user.
• For instance, information useful to a user could be extracted
periodically from a news service.
• In this case, we say that the IR system is executing a particular retrieval task which
consists of filtering relevant information for later inspection by the user.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
78. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
79. P1WU
UNIT – I: INTRODUCTION
Topic 5: THE INFORMATION VERSUS
RETRIEVAL
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
80. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus
Data Retrieval
6. The IR System
7. The Software Architecture of the IR System
8. The Retrieval and Ranking Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
81. INTRODUCTION TO INFORMATION
What is Information?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
82. INTRODUCTION TO RETRIEVAL
What is Retrieval?
• This does not mean that there is
no structure in the data
Document structure (headings,
paragraphs, lists. . . )
• Explicit markup formatting (e.g.
in HTML, XML. . . ) Linguistic
structure (latent, hidden)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SELECT *
FROM business
catalogue WHERE
category =
’florist’ AND city
zip = ’cb1’
83. INTRODUCTION TO RETRIEVAL
What is Information retrieval (IR) ?
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
84. INTRODUCTION TO RETRIEVAL
• An information need is
• the topic about which the user desires to know
more about.
• A query is
• what the user conveys to the computer in an
attempt to communicate the information need.
• A document is relevant
• if the user perceives that it contains information
of value with respect to their personal
information need.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Known-item search.
Precise information seeking
search Open-ended search
(“topical search”)
85. THE INFORMATION VERSUS RETRIEVAL
• Data retrieval, in the context of an IR system, consists
• mainly of determining which documents of a collection contain
• the keywords in the user query which, most frequently, is not enough to satisfy
the user information need.
• In fact, the user of an IR system is concerned more with
• retrieving information about a subject than with retrieving data that satisfies a
given query.
• For instance, a user of an IR system is willing to accept
• documents that contain synonyms of the query terms in the result set,
• even when those documents do not contain any query terms.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
86. INFORMATION RETRIEVAL
• “Ad hoc” retrieval web retrieval
• Support for browsing and filtering document collections:
• Clustering
• Classification; using fixed labels (common information needs, age groups,
topics; )
• Further processing a set of retrieved documents,
• e.g., by using natural language processing
• Information extraction Summarization Question answering.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
87. RETRIEVAL TYPES
1) Web search ( )
• Search ground are billions of documents on millions of computers
• issues: spidering; efficient indexing and search; malicious manipulation to boost
search engine rankings
2) Enterprise and institutional search ( )
• e.g company’s documentation, patents, research articles often domain-specific
• Centralised storage; dedicated machines for search.
• Most prevalent IR evaluation scenario: US intelligence analyst’s searches
3) Personal information retrieval (email, pers. documents; )
• e.g., Mac OS X Spotlight; Windows’ Instant Search
• Issues: different file types; maintenance-free, lightweight to run in background
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
88. THE INFORMATION VERSUS RETRIEVAL
• In an IR system the retrieved objects might be
• inaccurate and small errors are likely to go unnoticed.
• In a data retrieval system, on the contrary,
• a single erroneous object among a retrieval system,
• such as defined structure and semantics thousand retrieved
objects means total failure.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
89. THE INFORMATION VERSUS RETRIEVAL
• While A data a relational database, deals with data that has a well
defined structure and semantics.
• while an IR system deals with natural language text which is not
well structured.
• Data retrieval, while providing a solution to the user of a database
system, does not solve
• the problem of retrieving information about a subject or topic.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
90. THE INFORMATION VERSUS RETRIEVAL
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
S
n
Data Information
1
Unorganized raw facts that need processing without
which it is seemingly random and useless to humans
Information is a processed, organized data presented in a given
context and is useful to humans.
2
Data is an individual unit that contains raw material
which does not carry any specific meaning.
Information is a group of data thatcollectively carry a logical
meaning
3
.
Data Doesn't depended on information. Information depends on data.
4 It is measured in bits and bytes. Information is measured in meaningfulunits like time, quantity, etc.
5
Data is never suited to thespecific needs of a designer. Information is specific to the expectations and requirements
because all the irrelevant facts and figures are removed, during the
transformation process.
6
An example of Data is a
Student’score.
The average score of a class is the information derived from the
given data.
91. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
92. P1WU
UNIT – I: INTRODUCTION
Topic 6: THE IR SYSTEM
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
93. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
94. The role of an IR system
A modern view:
• Support the user in
– exploring a problem domain, understanding its
terminology, concepts and structure
– clarifying, refining and formulating an information need
– finding documents that match the info need description
• As many relevant docs as possible
• As few non-relevant documents as possible
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
95. How does it do this ?
• User interfaces and visualization tools for
– exploring a collection of documents
– exploring search results
• Query expansion based on
– Thesauri
– Lexical/statistic analysis of text / context and concept formation
– Relevance feedback
• Indexing and matching model
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
96. How well does it do this?
• Evaluation
– Of the components
• Indexing / matching algorithms
– Of the exploratory process overall
• Usability issues
• Usefulness to task
• User satisfaction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
97. What do we want from an IRS ?
• Systemic approach
– Goal (for a known information need):
• Return as many relevant documents as possible and as few
non-relevant documents as possible
• Cognitive approach
– Goal (in an interactive information-seeking environment,
with a given IRS):
• Support the user’s exploration of the problem domain and the task completion.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
98. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
OIRS is the techniques of storing and recovering and
often disseminating recorded data especially through
the use of a computerized system”
(Merriam Webster Dictionary)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
99. QUALITIES OF IRS
• The effectiveness of an IR system (i.e., the quality of its search results) is
determined by two key statistics about the system’s returned results for a query:
1. Precision: What fraction of the returned results are relevant to the
information need?
2. Recall: What fraction of the relevant documents in the collection were
returned by the system?
• Queries to be addressed:
• What is the best balance between the two?
• Easy to get perfect recall: just retrieve everything
• Easy to get good precision: retrieve only the most relevant
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
100. THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM
• To describe the IR system, we use a simple and generic software architecture as shown in Figure
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
101. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•The first step in setting up an IR system is to
assemble the document collection,
• which can be private or be crawled from the Web. In the
second case a crawler module is responsible for
collecting the documents.
•The document collection is stored in disk storage
usually referred to as the central repository.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
102. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•The documents in the central repository need to be
indexed for fast retrieval and ranking.
•The most used index structure is an inverted index
composed of all the distinct words of the collection
and, for each word,
• a list of the documents that contain it.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
103. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•Given that the document collection is indexed, the
retrieval process can be initiated.
•It consists of retrieving documents that satisfy either
a user query or a click in a hyper link.
• In the first case, we say that the user is searching for
information of interest;
• in the second case, we say that the user is browsing for
information of interest.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
104. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Use retrieval as PROCESS it applies to the searching process
requires browsing and how it compares to searching.
• To search, the user first specifies a query that reflects their
information need.
• Next, the user query is parsed and expanded with, for
instance, spelling variants of a query word.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
105. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The expanded query,
• which we refer to as the system query, is then processed against the index to
retrieve a subset of all documents.
• Following, the retrieved documents are ranked and the top
documents are returned to the user.
• The purpose of ranking is to identify the documents that are
most likely to be considered relevant by the user, and
constitutes the most critical part of the IR system.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
106. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Given the inherent subjectivity in deciding relevance,
• evaluating the quality of the answer set is a key step for improving the IR
system.
• A systematic evaluation process allows
• fine tuning the ranking algorithm and improving the quality of the results.
• The most common evaluation procedure consists of
comparing
• the set of results produced by the IR system with results suggested by
human specialists.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
107. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
108. P1WU
UNIT – I: INTRODUCTION
Topic 7: THE SOFTWARE ARCHITECTURE OF
THE IR SYSTEM
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
109. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of
the IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
110. SYSTEM ARCHITECTURE OF IRS
A high level view of the software architecture of an IR
system will provide:
1) Components
2) Tools
3) Environment
4) Data source(s)
Also additional elements needed for through
understanding of Data flow.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
111. THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM
• To describe the IR system, we use a simple and generic software architecture as shown in Figure
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
112. THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM
• To describe the IR system, we use a simple and generic software architecture as shown in Figure
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
113. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The first step in setting up an IR
system is to assemble the
document collection,
• which can be private or be crawled
from the Web. In the second case a
crawler module is responsible for
collecting the documents.
• The document collection is stored
in disk storage usually referred to
as the central repository.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
114. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The documents in the central
repository need to be indexed
for fast retrieval and ranking.
• The most used index structure
is an inverted index composed
of all the distinct words of the
collection and, for each word,
• a list of the documents that
contain it.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
115. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•Given that the document collection is indexed, the
retrieval process can be initiated.
•It consists of retrieving documents that satisfy either
a user query or a click in a hyper link.
• In the first case, we say that the user is searching for
information of interest;
• in the second case, we say that the user is browsing for
information of interest.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
116. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Use retrieval as PROCESS it applies to the searching process
requires browsing and how it compares to searching.
• To search, the user first specifies a query that reflects their
information need.
• Next, the user query is parsed and expanded with, for
instance, spelling variants of a query word.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
117. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The expanded query,
• which we refer to as the system query, is then processed against the index to
retrieve a subset of all documents.
• Following, the retrieved documents are ranked and the top
documents are returned to the user.
• The purpose of ranking is to identify the documents that are
most likely to be considered relevant by the user, and
constitutes the most critical part of the IR system.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
118. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Given the inherent subjectivity in deciding relevance,
• evaluating the quality of the answer set is a key step for improving the IR
system.
• A systematic evaluation process allows
• fine tuning the ranking algorithm and improving the quality of the results.
• The most common evaluation procedure consists of
comparing
• the set of results produced by the IR system with results suggested by
human specialists.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
119. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• To improve the ranking,
• we might collect feedback from the users and use this
information to change the results.
• In the Web,
• the most abundant form of user feedback are the clicks on the
documents in the results set.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
120. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Another important source of information for Web ranking are the
hyperlinks among pages,
• which allow identifying sites of high authority.
• There are many other concepts and technologies that bear impact
on
• the design of a full fledged IR system, such as a modern search engine.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
121. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
122. P1WU
UNIT – I: INTRODUCTION
Topic 8: THE RETRIEVAL AND RANKING
PROCESSES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
123. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
124. Ranked Retrieval
• Ranked retrieval
• The documents are ranked based on their score
• Advantages
– Query easy to specify
– The output is ranked based on the estimated relevance of the
documents to the query
– A wide variety of theoretical models exist
• Disadvantages
– Query less precise (although weighting can be used)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
125. THE RETRIEVAL AND RANKING PROCESSES
• To describe the retrieval and ranking processes, we further
elaborate on our description of the modules shown in Figure
1.2, as illustrated in Figure 1.3.
• Given the documents of the collection,
1. we first apply text operations to them such as eliminating stop
words, stemming, and
2. selecting a subset of all terms for use as indexing terms.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
126. THE RETRIEVAL AND RANKING PROCESSES
• To describe the IR system Retrieval and Ranking Processes, we illustrate it through the figure
as given below:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
127. THE RETRIEVAL AND RANKING PROCESSES
• To describe the IR system Retrieval and Ranking Processes, we illustrate it through the figure
as given below:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
128. THE RETRIEVAL AND RANKING PROCESSES
• The indexing terms are then used to compose document
representations,
• which might be smaller than the documents themselves (depending on the
subset of index terms selected).
• Given the document representations, it is necessary to build
an index of the text.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
129. THE RETRIEVAL AND RANKING PROCESSES
• Different index structures might be used,
• but the most popular one is an inverted index.
• The steps required to generate the index compose the indexing process and
must be executed offline,
• before the system is ready to process any queries.
• The resources (time and storage space) spent on the indexing process are amortized by
querying the retrieval system many times.
• Given that the document collection is indexed, the retrieval process can be initiated.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
130. THE RETRIEVAL AND RANKING PROCESSES
• The user first specifies a query that reflects their information
need.
• This query is then parsed and modified by operations that
resemble those applied to the documents.
• Typical operations at this point consist of spelling corrections
and elimination of terms such as stop words,
• whenever appropriated.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
131. THE RETRIEVAL AND RANKING PROCESSES
• Next, the transformed query is expanded and modified.
• For instance, the query might be modified using query suggestions made by
the system and confirmed by the user.
• The expanded and modified query is then processed to obtain the
set of retrieved documents,
• which is composed of documents that contain the query terms.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
132. THE RETRIEVAL AND RANKING PROCESSES
• Fast query processing is made possible by the index structure
previously built.
• The steps required to produce the set of retrieved documents
constitute the retrieval process.
• Next, the retrieved documents are ranked according to a likelihood
of relevance to the user.
• This is a most critical step because the quality of the results,
• as perceived by the users, is fundamentally dependent on the ranking.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
133. THE RETRIEVAL AND RANKING PROCESSES
• The top ranked documents are then formatted for presentation to
the user.
• The formatting consists of retrieving the title of the documents and
generating snippets for them,
• i.e., text excerpts that contain the query terms,
• which are then displayed to the user.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
134. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
135. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The documents in the central
repository need to be indexed
for fast retrieval and ranking.
• The most used index structure
is an inverted index composed
of all the distinct words of the
collection and, for each word,
• a list of the documents that
contain it.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
136. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
137. P1WU
UNIT – I: INTRODUCTION
Topic 9: THE WEB
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
138. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
139. WEB – What is Web?
• The Web, or World Wide Web (W3), is basically a
system of Internet servers that support specially
formatted documents.
• The documents are formatted in a markup
language called HTML (HyperText Markup
Language) that supports links to other
documents, as well as graphics, audio, and video
files.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
140. WEB – What Does Web Mean?
• The Web is the common name for the World Wide
Web, a subset of the Internet consisting of the pages
that can be accessed by a Web browser.
• Many people assume that the Web is the same as the
Internet, and use these terms interchangeably.
• However, the term Internet actually refers to the global network of
servers that makes the information sharing that happens over the
Web possible.
• So, although the Web does make up a large portion of the Internet,
but they are not one and same.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
141. WEB – KEY TOPICS
• WEB SITE OR WEB PAGES
• WEB SERVER
• WEB CLIENT
• WEB APPLICATIONS OR WEB APPs
• WEB SOFTWARES – WEB 1.0, WRB 2.0, WEB 3.0
• WEB SERVICES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
142. WEB - A Brief History
• “As We May Think” influenced people like Douglas
Engelbart who,
• at the Fall Joint Computer Conference in San Francisco in December
of 1968,
• ran a demonstration in which he introduced the first ever computer
mouse, video conferencing, teleconferencing, and hypertext.
• It was so incredible that it became known as “the
mother of all demos” [1690].
• Of the innovations displayed, the one that interests us
the most here is hypertext.
• The term was coined by Ted Nelson in his Project
Xanadu.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
143. WEB - A Brief History
• Hypertext allows the reader to jump from one electronic document
to another,
• which was one important property regarding the problem Tim Berners-Lee
faced in 1989.
• At the time, Berners-Lee worked in Geneva at the CERN – Conseil
Europeen pourla Recherche Nucleaire.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
144. WEB - A Brief History
• Researchers who wanted to share documentation with others had to
• reformat their documents to make them compatible with an internal
publishing system.
• It was annoying and generated many questions,
• many of which ended up been directed towards Berners-Lee.
• He understood that a better solution was required.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
145. WEB - A Brief History
• It just so happened that CERN was the largest Internet node in Europe.
• Berners Lee reasoned that
• it would be nice if the solution to the problem of sharing documents were
decentralized, such that the researchers could share their contributions freely.
• He saw that
• a networked hypertext, through the Internet, would be a good solution
and started working on its implementation
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
146. WEB - A Brief History
• In 1990,
• he wrote the HTTP protocol, defined the HTML language,
• wrote the first browser, which he called “World Wide Web”, and the first
Web server.
• In 1991,
• he made his browser and server software available in the Internet. The Web
was born.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
147. Web Information Retrieval : What is web in information retrieval?
•Web Information retrieval is
• the process of searching within
a huge World Wide Web
document collection for a
particular information
need (called a query).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
148. Web Information Retrieval : What is web in information retrieval?
• The Web can be considered as
• a large-scale document collection, for which classical text retrieval techniques can be
applied.
• However, its unique features and structure offer new sources of evidence that can be
used to enhance the effectiveness of Information Retrieval (IR) systems.
• Generally, Web IR examines
• the combination of evidence from both the textual content of documents and
• the structure of the Web,
• as well as the search behavior of users and issues related to the evaluation of retrieval
effectiveness in the Web setting.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
149. Web Information Retrieval
• Web documents / data
• No traditional collection
– Huge
• Time and space to crawl index
• IRSs cannot store copies of documents
– Dynamic, volatile, anarchic, un-
controlled
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
150. Web Information Retrieval
– Homogeneous sub-collections
• Structure
– In documents (un-/semi-/fully-structured)
– Between docs: network of inter-connected nodes
– Hyper-links - conceptual vs. physical documents
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
151. Web Information Retrieval
• Web Information Retrieval models are
• ways of integrating many sources of evidence about documents, such
as
1. the links,
2. the structure of the document, the actual content of the document,
3. the quality of the document, etc.
4. so that an effective Web search engine can be achieved.
• In contrast with the traditional library-type settings of IR systems,
• the Web is a hostile environment, where Web search engines have to deal with
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
152. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
153. P1WU
UNIT – I: INTRODUCTION
Topic 10: THE E-PUBLISHING ERA
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
154. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
155. What is E-Publishing?
• Electronic publishing (also referred to as publishing, digital publishing,
or online publishing) includes
• the digital publication of
• e-books,
• digital magazines, and
• the development of digital libraries and catalogues.
• It also includes
• the editing of
• books,
• journals and
• Magazines
to be posted on a screen (computer, e-reader, tablet, or Smartphone).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
156. Need of E-Publishing
• Since its inception,
• the Web became a huge success.
• The number of Web pages now far exceeds
• 20 billion and the number of Web users in the world exceeds 1.7 billion.
• It is known that there are
• more than one trillion distinct URLs on the Web,
• even if many of them are pointers to dynamic pages, not static HTML pages.
• A viable model of economic sustainability based on online advertising was
developed.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
157. THE E-PUBLISHING ERA
• Electronic publishing has become common in scientific publishing
• where it has been argued that peer-reviewed scientific journals are in the process of
being replaced by electronic publishing.
• It is also becoming common to distribute
• books, magazines, and newspapers to consumers through tablet reading devices,
• a market that is growing by millions each year,generated by online vendors .
• Apple's iTunes bookstore, Amazon's bookstore for Kindle, and books in the Google Play
Bookstore.
• Market research suggested that half of all magazine and newspaper
circulation is based on E-Publishing.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
158. THE E-PUBLISHING ERA
• The advent of the Web changed the world in a way that few people
could have anticipated.
• Yet, one has to wonder on the characteristics of the Web that have made it
so successful.
• Is there a single characteristic of the Web that was most decisive
for its success?
• The simple HTML markup language, the low access costs, the wide spread
reach of the Internet, the interactive browser interface, the search engines.
• While providing the fundamental infrastructure for the Web,
• these technologies were not the root cause of its popularity.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
159. THE E-PUBLISHING ERA – It ‘s Time of Birth
• What was it then?
• The fundamental shift in human relationships, introduced by the Web, was freedom to publish.
• Example:
• Jane Austen is one of the most famous writers in English literature. Her books are read
by people all over the world and have been made into countless TV, film, theatre and
radio adaptations.
• This is all the more impressive because she only wrote six full-length novels.
• Jane Austen did not have that freedom,
• so she had to either convince a publisher of the quality of her work or pay for the publication of
an edition of it herself.
• Since she could not pay for it,
• she had to be patient and wait for the publisher to become convinced.
• It took 15 years.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
160. THE E-PUBLISHING ERA – It ‘s Time of Birth
• In the world of the Web, this is no longer the case.
• People can now publish their ideas on the Web and reach millions of
people over night,
• without paying anything for it and without having to convince the editorial board of
a large publishing company.
• Restrictions imposed by mass communication media companies and
• by natural geographical barriers were almost entirely removed by the invention of
the Web,
• which has led to a freedom to publish that marks the birth of a new era.
• One which we refer to as The e-Publishing Era.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
161. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
162. P1WU
UNIT – I: INTRODUCTION
Topic 11: HOW THE WEB CHANGED SEARCH
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
163. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
164. HOW THE WEB CHANGED SEARCH
• Web search is
• today ‘s the most prominent application of IR and its techniques.
• Indeed, the ranking and indexing components of
any search engine are
• fundamentally IR pieces of technology.
• An immediate consequence is that
• the Web has had a major impact in the development of IR
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
165. HOW THE WEB CHANGED SEARCH
1) The first major impact of the Web on search is
• related to the characteristicsof the document collection itself.
• The Web collection is
• composed of documents(or pages) distributed over millions of sites and connectedthrough hyperlinks,
• i.e., links that associate a piece of text of a page with other Web pages.
• The inherent distributed nature of the Web collection requires
• collectingall documentsand storing copies of them in a central repository,prior to indexing.
• This new phase in the IR process, introduced by the Web, is called Web Search.
• The system that implements the IR process(es) is called Search Engine or IR System.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
166. HOW THE WEB CHANGED SEARCH
2) The second major impact of the Web on search is related to
• the size of the collection and the volume of user queries submitted on a daily basis.
• Given that the Web grew larger and faster than any previous known text collection,
• the search engines have now to handle a volume of text that far exceeds 20 billion pages,
• i.e., a volume of text much larger than any previous text collection
The volume of user queries is also much larger than ever before, even if estimates vary
widely.
• The combination of
• a very large text collection with a very high query traffic has pushed the performance
and scalability of search engines to limits that largely exceed those of any previous IR
system.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
167. HOW THE WEB CHANGED SEARCH
• That is,
• performance and scalability have become critical characteristics of the IR system, much
more than they used to be prior to the Web.
• While we do not discuss performance and scalability of search engines.
3) The third major impact of the Web on search is also related to the vast
size of the document collection.
• In a very large collection, predicting relevance is much harder than before.
• Basically, any query retrieves a large number of documents that match its terms,
which means that there are many noisy documents in the set of retrieved documents.
• That is, documents that seem related to the query
• but are actually not relevant to it according to the judgment of a large fraction of the
users are retrieved.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
168. HOW THE WEB CHANGED SEARCH
• This problem first showed up in the early Web search engines and became more severe as the Web grew.
• Fortunately, the Web also includes
• new sources of evidence not present in standard document collections that can be used to alleviatethe problem,
such as hyperlinks and user clicks in documents in the answer set.
• Two other major impacts of the Web on search derive from the fact that the Web is not just a repository of
documents and data, but also a medium to do business.
• One immediate implication is that the search problem has been extended beyond the seeking of text
information to also encompass other user needs such as the price of a book, the phone number of a
hotel, the link for downloading a software.
• Providing effective answers to
• these types of information needs frequentlyrequiresidentifying structured dataassociatedwith the object of
interest such as price, location, or descriptionsof some of its key characteristics.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
169. HOW THE WEB CHANGED SEARCH
The fifth and final impact of the Web on search derives from Web
advertising and other economic incentives.
The continued success of the Web as an interactive media for the
masses created incentives for its economic exploration in the
form of, for instance, advertising and electronic commerce.
These incentives led also to the abusive availability of commercial
information disguised in the form of purely informational
content, which is usually referred to as Web spam.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
170. HOW THE WEB CHANGED SEARCH
The increasingly pervasive presence of spam on the Web has made the quest for
relevance even more difficult than before,
i.e., spam content is sometimes so compelling that it is confused with truly
relevant content.
Because of that, it is not unreasonable to think that spam makes relevance
negative,
i.e., the presence of spam makes the current ranking algorithms produce
answers sets that are worst than they would be if the Web were spam
free.
This difficulty is so large that today we talk of Adversarial Web Retrieval.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
171. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
172. P1WU
UNIT – I: INTRODUCTION
Topic 12: PRACTICAL ISSUES ON THE WEB
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
173. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the
Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
174. PRACTICAL ISSUES ON THE WEB
• Electronic commerce is a major trend on the
Web nowadays and one which has benefited
millions of people.
• In an electronic transaction, the buyer usually
submits to the vendor credit information to be
used for charging purposes.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
175. PRACTICAL ISSUES ON THE WEB
• In its most common form, such information consists of
a credit card number.
• For security reasons, this information is usually
encrypted, as done by institutions and companies that
deploy automatic authentication processes.
• Besides security, another issue of major interest is
privacy. Frequently, people are willing to exchange
information as long as it does not become public.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
176. PRACTICAL ISSUES ON THE WEB
• The reasons are many,
• but the most common one is to protect oneself against misuse of private information by third
parties.
• Privacy is another issue which affects the deployment of the Web and which has not
been properly addressed yet.
• Two other important issues are
1. copyright and
2. patent rights.
• It is far from clear how the wide spread of data on the Web affects copyright and
patent laws in the various countries.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
177. PRACTICAL ISSUES ON THE WEB
• This is important because it affects the business of building up and
deploying large digital libraries.
• For instance,
• is a site which supervises all the information it posts acting as a publisher?
• And if so, is it responsible for misuse of the information it posts (even if it
is not the source)?
• Additionally, other practical issues of interest include
• scanning, optical character recognition (OCR), and cross-language retrieval
• (in which the query is in one language but the documents retrieved are in
another language).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
178. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
179. P1WU
UNIT – I: INTRODUCTION
Topic 13: HOW PEOPLE SEARCH
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
180. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
181. HOW PEOPLE SEARCH?
• Search tasks range from the relatively simple
• e.g., looking up disputed facts or finding weather
information
• to the rich and complex
• e.g., job seeking and planning vacations.
• Search interfaces should support a range of tasks,
• while taking into account how people think about searching
for information.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
182. Information Lookup versus Exploratory Search
• User interaction with search interfaces differs depending on
a) the type of task,
b) the amount of time and
c) effort available to invest in the process, and the domain expertise
of the information seeker.
• The simple interaction dialogue used in Web search engines is
most appropriate for :
1. finding answers to questions or
2. to finding Web sites or other resources that act as search starting
points.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
183. HOW PEOPLE SEARCH?
•But, as Marchionini notes,
•the “turn-taking” interface of Web search engines is
• inherently limited and in many cases is being supplanted by
specialty search engines
• such as
• for travel and health information – that offer richer
interaction models.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
184. Classic versus Dynamic Model of Information Seeking
• Researchers have developed numerous theoretical models of
• how people go about doing search tasks?
• The classic notion of the information seeking process model as
described by
• Sutcliffe and Ennis is formulated as a cycle consisting of four main
activities:
1. problem identification,
2. articulation of information need(s),
3. query formulation, and
4. results evaluation.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
185. Classic versus Dynamic Model of Information Seeking
• The standard model of the information seeking process contains
• an underlying assumption that the user’s information need is static and the
information seeking process is one of successively refining a query until
• all and only those documents relevant to the original information need have been
retrieved.
• More recent models emphasize the dynamic nature of the search process,
• noting that users learn as they search, and their information needs adjust as they
see retrieval results and other document surrogates.
• This dynamic process is sometimes referred to as the berry picking model
of search.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
186. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
187. P1WU
UNIT – I: INTRODUCTION
Topic 14: SEARCH INTERFACES TODAY
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
188. UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES