SlideShare a Scribd company logo
1 of 34
Download to read offline
Modern Information Retrieval
Chapter 1
Introduction
Information Retrieval
The IR Problem
The IR System
The Web
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 1
Information Retrieval (IR)
IR deals with the representation, storage, organization
of, and access to information items
Types of information items: documents, Web pages, online
catalogs, structured records, multimedia objects
Early goals of the IR area: indexing text and searching
for useful documents in a collection
Nowadays, research in IR includes:
Modeling, Web search, text classification, systems architecture,
user interfaces, data visualization, filtering and languages
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 2
Early Developments
For more than 5,000 years, man has organized
information for later retrieval and searching
This has been done by compiling, storing, organizing, and
indexing papyrus, hieroglyphics, and books
For holding the various items, special purpose buildings
called libraries, or bibliothekes, are used
The oldest known library was created in Elba, in the Fertile
Crescent, between 3,000 and 2,500 BC
By 300 BC, Ptolemy Soter, a Macedonian general, created the
Great Library at Alexandria
Nowadays, libraries are everywhere
In 2008, more than 2 billion items were checked out from
libraries in the US—an increase of 10% over the previous year
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 3
Early Developments
Since the volume of information in libraries is always
growing, it is necessary to build specialized data
structures for fast search — the indexes
For centuries indexes have been created manually as
sets of categories, with labels associated with each
category
The advent of modern computers has allowed the
construction of large indexes automatically
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 4
Early Developments in IR
During the 50’s, research efforts in IR were initiated by
pioneers such as Hans Peter Luhn, Eugene Garfield,
Philip Bagley, and Calvin Moores, who allegedly coined
the term Information Retrieval
In 1962, Cyril Cleverdon published the Cranfield studies
on retrieval evaluation
In 1963, Joseph Becker and Robert Hayes published
the first book on IR
In the late 60’s, key research conducted by Karen
Sparck Jones and Gerard Salton, among others, led to
the definition of the TF-IDF term weighting scheme
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 5
Early Developments in IR
In 1971, Jardine and van Rijsbergen articulated the
cluster hypothesis
In 1978, the first ACM SIGIR Internation Conference on
Information Retrieval was held in Rochester
In 1979, van Rijsbergen published a classic book
entitled Information Retrieval, which focused on the
Probabilistic Model
In 1983, Salton and McGill published a classic book
entitled Introduction to Modern Information Retrieval,
which focused on the Vector Model
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 6
Libraries and Digital Libraries
Libraries were among the first institutions to adopt IR
systems for retrieving information
Initially, such systems consisted of an automation of
existing processes such as card catalogs searching
Increased search functionality was then added
Ex: subject headings, keywords, query operators
Nowadays, the focus has been on improved graphical
interfaces, electronic forms, hypertext features
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 7
IR at the Center of the Stage
Until recently, IR was an area of interest restricted
mainly to librarians and information experts
A single fact changed these perceptions—the
introduction of the Web, which has become the largest
repository of knowledge in human history
Due to its enormous size, finding useful information on
the Web usually requires running a search
And searching on the Web is all about IR and its
technologies
Thus, almost overnight, IR has gained a
place with other technologies at the
center of the stage
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 8
The IR Problem
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 9
The IR Problem
Users of modern IR systems, such as search engine
users, have information needs of varying complexity
An example of complex information need is as follows:
Find all documents that address the role
of the Federal Government in financing
the operation of the National Railroad
Transportation Corporation (AMTRAK)
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 10
The IR Problem
This full description of the user information need is not
necessarily a good query to be submitted to the IR
system
Instead, the user might want to first translate this
information need into a query
This translation process yields a set of keywords, or
index terms, which summarize the user information
need
Given the user query, the key goal of the IR system is to
retrieve information that is useful or relevant to the user
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 11
The IR Problem
That is, the IR system must rank the information items
according to a degree of relevance to the user query
The IR Problem
The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few nonrelevant items as possible
The notion of relevance is of central importance in IR
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 12
The User’s Task
Consider a user who seeks information on a topic of
their interest
This user first translates their information need into a query,
which requires specifying the words that compose the query
In this case, we say that the user is searching or querying for
information of their interest
Consider now a user who has an interest that is either
poorly defined or inherently broad
For instance, the user has an interest in car racing and wants to
browse documents on Formula 1 and Formula Indy
In this case, we say that the user is browsing or navigating the
documents of the collection
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 13
The User’s Task
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 14
Information × Data Retrieval
Data retrieval: the task of determining which documents
of a collection contain the keywords in the user query
Data retrieval system
Ex: relational databases
Deals with data that has a well defined structure and semantics
A single erroneous object among a thousand retrieved objects
means total failure
Data retrieval does not solve the problem of retrieving
information about a subject or topic
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 15
The IR System
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 16
Architecture of the IR System
High level software architecture of an IR system
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 17
Retrieval and Ranking Processes
The processes of indexing, retrieval, and ranking
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 18
The Web
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 19
A Brief History
At the end of World War II, Vannevar Bush looked for
applications of new technologies to peace times
Bush first produced a report entitled Science, The
Endless Frontier
This report directly influenced the creation of the National
Science Foundation
Following, he wrote As We May Think, a remarkable
paper which discussed new hardware and software
gadgets
In Bush’s words:
Whole new forms of encyclopedias will appear, ready-made with a
mesh of associative trails running through them, ready to be dropped
into the memex and there amplified
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 20
A Brief History
As We May Think influenced people like Douglas
Engelbart, who invented the computer mouse and
introduced the concept of hyperlinked texts
Ted Nelson, working in his Project Xanadu, pushed the
concept further and coined the term hypertext
A hypertext allows the reader to jump from one
electronic document to another, which was one
important property regarding the problem that Tim
Berners-Lee faced in 1989
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 21
A Brief History
At the time, Berners-Lee worked in Geneva at the
CERN—Conseil Européen pour la Recherche Nucléaire
There, researchers who wanted to share documentation
with others had to reformat their documents to make
them compatible with an internal publishing system
Berners-Lee reasoned that it would be nice if the
solution of sharing documents were decentralized
He saw that a networked hypertext would be a good
solution and started working on its implementation
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 22
A Brief History
In 1990, Berners-Lee
Wrote the HTTP protocol
Defined the HTML language
Wrote the first browser, which he called World Wide Web
Wrote the first Web server
In 1991, he made his browser and server software
available in the Internet
The Web was born!
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 23
The e-Publishing Era
Since its inception, the Web became a huge success
Well over 20 billion pages are now available and accessible in the
Web
More than one fourth of humanity now access the Web on a
regular basis
Why is the Web such a success? What is the single
most important characteristic of the Web that makes it
so revolutionary?
In search for an answer, let us dwell into the life of a
writer who lived at the end of the 18th Century
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 24
The e-Publishing Era
She finished the first draft of her novel in 1796
The first attempt of publication was refused without a reading
The novel was only published 15 years later!
She got a flat fee of $110, which meant that she was not paid
anything for the many subsequent editions
Further, her authorship was anonymized under the reference “By
a Lady”
We are talking of ...
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 25
The e-Publishing Era
Pride and Prejudice is the second or third best loved
novel in the UK ever, after The Lord of the Rings and
Harry Potter
It has been the subject of six TV series and five film
versions
The last of these, starring Keira Knightley and Matthew
Macfadyen, grossed over 100 million dollars
Jane Austen published anonymously her entire life
Throughout the 20th century, her novels have never
been out of print
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 26
The e-Publishing Era
Jane Austen was discriminated because there was no
freedom to publish in the beginning of the 19th century
The Web, unleashed by the inventiveness of Tim
Berners-Lee, changed this once and for all
It did so by universalizing freedom to publish
The Web moved mankind into a new era,
into a new time, into The e-Publishing
Era
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 27
How the Web Changed Search
Web search is today the most prominent application of
IR and its techniques—the ranking and indexing
components of any search engine are fundamentally IR
pieces of technology
The first major impact of the Web on search is related
to the characteristics of the document collection itself
The Web is composed of pages distributed over millions of sites
and connected through hyperlinks
This requires collecting all documents and storing copies of them
in a central repository, prior to indexing
This new phase in the IR process, introduced by the Web, is
called crawling
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 28
How the Web Changed Search
The second major impact of the Web on search is
related to:
The size of the collection
The volume of user queries submitted on a daily basis
As a consequence, performance and scalability have become
critical characteristics of the IR system
The third major impact: in a very large collection,
predicting relevance is much harder than before
Fortunately, the Web also includes new sources of evidence
Ex: hyperlinks and user clicks in documents in the answer set
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 29
How the Web Changed Search
The fourth major impact derives from the fact that the
Web is also a medium to do business
Search problem has been extended beyond the seeking of text
information to also encompass other user needs
Ex: the price of a book, the phone number of a hotel, the link for
downloading a software
The fifth major impact of the Web on search is Web
spam
Web spam: abusive availability of commercial information
disguised in the form of informational content
This difficulty is so large that today we talk of Adversarial Web
Retrieval
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 30
Practical Issues in the Web
Security
Commercial transations over the Internet are not yet a completely
safe procedure
Privacy
Frequently, people are willing to exchange information as long as
it does not become public
Copyright and patent rights
It is far from clear how the wide spread of data on the Web affects
copyright and patent laws in the various countries
Scanning, optical character recognition (OCR), and
cross-language retrieval
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 31
Organization of the Book
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 32
Focus of the Book
The book presents an overall view of research in IR
from a computer scientist’s perspective
This means that the main focus of the book is on computer
algorithms and techniques used in IR systems
A rather distinct viewpoint is taken by librarians and
information science researchers
In this viewpoint, the focus is on trying to understand how people
interpret and use information
This human-centered viewpoint is discussed in the user
interfaces chapter and in the last two chapters of the
book
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 33
Book Contents
Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 34

More Related Content

Similar to slides_chap01.pdf

INFORMATION MARKETING
INFORMATION MARKETINGINFORMATION MARKETING
INFORMATION MARKETINGharshaec
 
INFORMATION SCIENCE
INFORMATION SCIENCEINFORMATION SCIENCE
INFORMATION SCIENCEharshaec
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxHasanulFahmi2
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyBhojaraju Gunjal
 
Bioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalBioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalDr. Rupak Chakravarty
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overviewAnkit Dubey
 
Need of Digital Libraries in Education.
Need of Digital Libraries in Education.Need of Digital Libraries in Education.
Need of Digital Libraries in Education.inventionjournals
 
A presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptxA presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptxROHITSHARMA779690
 
Data Integration Lecture
Data Integration LectureData Integration Lecture
Data Integration LectureSUNY Oneonta
 
History of Big Data
History of Big DataHistory of Big Data
History of Big DataHEXANIKA
 
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCERELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCELibcorpio
 
Big data a rescue plan
Big data a rescue planBig data a rescue plan
Big data a rescue planTapasya123
 
Information storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdfInformation storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdfSURAJDHIKAR1
 

Similar to slides_chap01.pdf (20)

INFORMATION MARKETING
INFORMATION MARKETINGINFORMATION MARKETING
INFORMATION MARKETING
 
INFORMATION SCIENCE
INFORMATION SCIENCEINFORMATION SCIENCE
INFORMATION SCIENCE
 
H0362049060
H0362049060H0362049060
H0362049060
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
 
Towards Knowledge-Enabled Society
Towards Knowledge-Enabled SocietyTowards Knowledge-Enabled Society
Towards Knowledge-Enabled Society
 
Bioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalBioinformatioc: Information Retrieval
Bioinformatioc: Information Retrieval
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overview
 
Need of Digital Libraries in Education.
Need of Digital Libraries in Education.Need of Digital Libraries in Education.
Need of Digital Libraries in Education.
 
A presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptxA presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptx
 
Data Integration Lecture
Data Integration LectureData Integration Lecture
Data Integration Lecture
 
History of Big Data
History of Big DataHistory of Big Data
History of Big Data
 
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCERELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
 
Big data a rescue plan
Big data a rescue planBig data a rescue plan
Big data a rescue plan
 
Big Data: A Rescue Plan
Big Data: A Rescue PlanBig Data: A Rescue Plan
Big Data: A Rescue Plan
 
Mam assign
Mam assignMam assign
Mam assign
 
ICT Today
ICT TodayICT Today
ICT Today
 
lecture-1-1487765601.pptx
lecture-1-1487765601.pptxlecture-1-1487765601.pptx
lecture-1-1487765601.pptx
 
Information storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdfInformation storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdf
 

More from lekhacce

Introduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptxIntroduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptxlekhacce
 
HTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptxHTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptxlekhacce
 
javascript client side scripting la.pptx
javascript client side scripting la.pptxjavascript client side scripting la.pptx
javascript client side scripting la.pptxlekhacce
 
matlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsxmatlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsxlekhacce
 
webdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptxwebdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptxlekhacce
 
1_chapter one Java content materials.ppt
1_chapter one Java content materials.ppt1_chapter one Java content materials.ppt
1_chapter one Java content materials.pptlekhacce
 
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxInformation RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxlekhacce
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdflekhacce
 
slides_chap02.pdf
slides_chap02.pdfslides_chap02.pdf
slides_chap02.pdflekhacce
 

More from lekhacce (10)

Introduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptxIntroduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptx
 
HTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptxHTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptx
 
javascript client side scripting la.pptx
javascript client side scripting la.pptxjavascript client side scripting la.pptx
javascript client side scripting la.pptx
 
matlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsxmatlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsx
 
webdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptxwebdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptx
 
1_chapter one Java content materials.ppt
1_chapter one Java content materials.ppt1_chapter one Java content materials.ppt
1_chapter one Java content materials.ppt
 
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxInformation RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
 
slides_chap02.pdf
slides_chap02.pdfslides_chap02.pdf
slides_chap02.pdf
 
AES.pptx
AES.pptxAES.pptx
AES.pptx
 

Recently uploaded

Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementDr. Deepak Mudgal
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxhublikarsn
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)ChandrakantDivate1
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information SystemsAnge Felix NSANZIYERA
 
Compressing and Sparsifying LLM in GenAI Applications
Compressing and Sparsifying LLM in GenAI ApplicationsCompressing and Sparsifying LLM in GenAI Applications
Compressing and Sparsifying LLM in GenAI ApplicationsMFatihSIRA
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...
Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...
Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...dannyijwest
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessorAshwiniTodkar4
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...ssuserdfc773
 
Danikor Product Catalog- Screw Feeder.pdf
Danikor Product Catalog- Screw Feeder.pdfDanikor Product Catalog- Screw Feeder.pdf
Danikor Product Catalog- Screw Feeder.pdfthietkevietthinh
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesChandrakantDivate1
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...ronahami
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 

Recently uploaded (20)

Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptx
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
Compressing and Sparsifying LLM in GenAI Applications
Compressing and Sparsifying LLM in GenAI ApplicationsCompressing and Sparsifying LLM in GenAI Applications
Compressing and Sparsifying LLM in GenAI Applications
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...
Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...
Cybercrimes in the Darknet and Their Detections: A Comprehensive Analysis and...
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
 
Danikor Product Catalog- Screw Feeder.pdf
Danikor Product Catalog- Screw Feeder.pdfDanikor Product Catalog- Screw Feeder.pdf
Danikor Product Catalog- Screw Feeder.pdf
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 

slides_chap01.pdf

  • 1. Modern Information Retrieval Chapter 1 Introduction Information Retrieval The IR Problem The IR System The Web Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 1
  • 2. Information Retrieval (IR) IR deals with the representation, storage, organization of, and access to information items Types of information items: documents, Web pages, online catalogs, structured records, multimedia objects Early goals of the IR area: indexing text and searching for useful documents in a collection Nowadays, research in IR includes: Modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering and languages Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 2
  • 3. Early Developments For more than 5,000 years, man has organized information for later retrieval and searching This has been done by compiling, storing, organizing, and indexing papyrus, hieroglyphics, and books For holding the various items, special purpose buildings called libraries, or bibliothekes, are used The oldest known library was created in Elba, in the Fertile Crescent, between 3,000 and 2,500 BC By 300 BC, Ptolemy Soter, a Macedonian general, created the Great Library at Alexandria Nowadays, libraries are everywhere In 2008, more than 2 billion items were checked out from libraries in the US—an increase of 10% over the previous year Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 3
  • 4. Early Developments Since the volume of information in libraries is always growing, it is necessary to build specialized data structures for fast search — the indexes For centuries indexes have been created manually as sets of categories, with labels associated with each category The advent of modern computers has allowed the construction of large indexes automatically Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 4
  • 5. Early Developments in IR During the 50’s, research efforts in IR were initiated by pioneers such as Hans Peter Luhn, Eugene Garfield, Philip Bagley, and Calvin Moores, who allegedly coined the term Information Retrieval In 1962, Cyril Cleverdon published the Cranfield studies on retrieval evaluation In 1963, Joseph Becker and Robert Hayes published the first book on IR In the late 60’s, key research conducted by Karen Sparck Jones and Gerard Salton, among others, led to the definition of the TF-IDF term weighting scheme Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 5
  • 6. Early Developments in IR In 1971, Jardine and van Rijsbergen articulated the cluster hypothesis In 1978, the first ACM SIGIR Internation Conference on Information Retrieval was held in Rochester In 1979, van Rijsbergen published a classic book entitled Information Retrieval, which focused on the Probabilistic Model In 1983, Salton and McGill published a classic book entitled Introduction to Modern Information Retrieval, which focused on the Vector Model Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 6
  • 7. Libraries and Digital Libraries Libraries were among the first institutions to adopt IR systems for retrieving information Initially, such systems consisted of an automation of existing processes such as card catalogs searching Increased search functionality was then added Ex: subject headings, keywords, query operators Nowadays, the focus has been on improved graphical interfaces, electronic forms, hypertext features Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 7
  • 8. IR at the Center of the Stage Until recently, IR was an area of interest restricted mainly to librarians and information experts A single fact changed these perceptions—the introduction of the Web, which has become the largest repository of knowledge in human history Due to its enormous size, finding useful information on the Web usually requires running a search And searching on the Web is all about IR and its technologies Thus, almost overnight, IR has gained a place with other technologies at the center of the stage Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 8
  • 9. The IR Problem Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 9
  • 10. The IR Problem Users of modern IR systems, such as search engine users, have information needs of varying complexity An example of complex information need is as follows: Find all documents that address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 10
  • 11. The IR Problem This full description of the user information need is not necessarily a good query to be submitted to the IR system Instead, the user might want to first translate this information need into a query This translation process yields a set of keywords, or index terms, which summarize the user information need Given the user query, the key goal of the IR system is to retrieve information that is useful or relevant to the user Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 11
  • 12. The IR Problem That is, the IR system must rank the information items according to a degree of relevance to the user query The IR Problem The key goal of an IR system is to retrieve all the items that are relevant to a user query, while retrieving as few nonrelevant items as possible The notion of relevance is of central importance in IR Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 12
  • 13. The User’s Task Consider a user who seeks information on a topic of their interest This user first translates their information need into a query, which requires specifying the words that compose the query In this case, we say that the user is searching or querying for information of their interest Consider now a user who has an interest that is either poorly defined or inherently broad For instance, the user has an interest in car racing and wants to browse documents on Formula 1 and Formula Indy In this case, we say that the user is browsing or navigating the documents of the collection Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 13
  • 14. The User’s Task Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 14
  • 15. Information × Data Retrieval Data retrieval: the task of determining which documents of a collection contain the keywords in the user query Data retrieval system Ex: relational databases Deals with data that has a well defined structure and semantics A single erroneous object among a thousand retrieved objects means total failure Data retrieval does not solve the problem of retrieving information about a subject or topic Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 15
  • 16. The IR System Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 16
  • 17. Architecture of the IR System High level software architecture of an IR system Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 17
  • 18. Retrieval and Ranking Processes The processes of indexing, retrieval, and ranking Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 18
  • 19. The Web Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 19
  • 20. A Brief History At the end of World War II, Vannevar Bush looked for applications of new technologies to peace times Bush first produced a report entitled Science, The Endless Frontier This report directly influenced the creation of the National Science Foundation Following, he wrote As We May Think, a remarkable paper which discussed new hardware and software gadgets In Bush’s words: Whole new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 20
  • 21. A Brief History As We May Think influenced people like Douglas Engelbart, who invented the computer mouse and introduced the concept of hyperlinked texts Ted Nelson, working in his Project Xanadu, pushed the concept further and coined the term hypertext A hypertext allows the reader to jump from one electronic document to another, which was one important property regarding the problem that Tim Berners-Lee faced in 1989 Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 21
  • 22. A Brief History At the time, Berners-Lee worked in Geneva at the CERN—Conseil Européen pour la Recherche Nucléaire There, researchers who wanted to share documentation with others had to reformat their documents to make them compatible with an internal publishing system Berners-Lee reasoned that it would be nice if the solution of sharing documents were decentralized He saw that a networked hypertext would be a good solution and started working on its implementation Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 22
  • 23. A Brief History In 1990, Berners-Lee Wrote the HTTP protocol Defined the HTML language Wrote the first browser, which he called World Wide Web Wrote the first Web server In 1991, he made his browser and server software available in the Internet The Web was born! Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 23
  • 24. The e-Publishing Era Since its inception, the Web became a huge success Well over 20 billion pages are now available and accessible in the Web More than one fourth of humanity now access the Web on a regular basis Why is the Web such a success? What is the single most important characteristic of the Web that makes it so revolutionary? In search for an answer, let us dwell into the life of a writer who lived at the end of the 18th Century Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 24
  • 25. The e-Publishing Era She finished the first draft of her novel in 1796 The first attempt of publication was refused without a reading The novel was only published 15 years later! She got a flat fee of $110, which meant that she was not paid anything for the many subsequent editions Further, her authorship was anonymized under the reference “By a Lady” We are talking of ... Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 25
  • 26. The e-Publishing Era Pride and Prejudice is the second or third best loved novel in the UK ever, after The Lord of the Rings and Harry Potter It has been the subject of six TV series and five film versions The last of these, starring Keira Knightley and Matthew Macfadyen, grossed over 100 million dollars Jane Austen published anonymously her entire life Throughout the 20th century, her novels have never been out of print Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 26
  • 27. The e-Publishing Era Jane Austen was discriminated because there was no freedom to publish in the beginning of the 19th century The Web, unleashed by the inventiveness of Tim Berners-Lee, changed this once and for all It did so by universalizing freedom to publish The Web moved mankind into a new era, into a new time, into The e-Publishing Era Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 27
  • 28. How the Web Changed Search Web search is today the most prominent application of IR and its techniques—the ranking and indexing components of any search engine are fundamentally IR pieces of technology The first major impact of the Web on search is related to the characteristics of the document collection itself The Web is composed of pages distributed over millions of sites and connected through hyperlinks This requires collecting all documents and storing copies of them in a central repository, prior to indexing This new phase in the IR process, introduced by the Web, is called crawling Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 28
  • 29. How the Web Changed Search The second major impact of the Web on search is related to: The size of the collection The volume of user queries submitted on a daily basis As a consequence, performance and scalability have become critical characteristics of the IR system The third major impact: in a very large collection, predicting relevance is much harder than before Fortunately, the Web also includes new sources of evidence Ex: hyperlinks and user clicks in documents in the answer set Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 29
  • 30. How the Web Changed Search The fourth major impact derives from the fact that the Web is also a medium to do business Search problem has been extended beyond the seeking of text information to also encompass other user needs Ex: the price of a book, the phone number of a hotel, the link for downloading a software The fifth major impact of the Web on search is Web spam Web spam: abusive availability of commercial information disguised in the form of informational content This difficulty is so large that today we talk of Adversarial Web Retrieval Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 30
  • 31. Practical Issues in the Web Security Commercial transations over the Internet are not yet a completely safe procedure Privacy Frequently, people are willing to exchange information as long as it does not become public Copyright and patent rights It is far from clear how the wide spread of data on the Web affects copyright and patent laws in the various countries Scanning, optical character recognition (OCR), and cross-language retrieval Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 31
  • 32. Organization of the Book Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 32
  • 33. Focus of the Book The book presents an overall view of research in IR from a computer scientist’s perspective This means that the main focus of the book is on computer algorithms and techniques used in IR systems A rather distinct viewpoint is taken by librarians and information science researchers In this viewpoint, the focus is on trying to understand how people interpret and use information This human-centered viewpoint is discussed in the user interfaces chapter and in the last two chapters of the book Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 33
  • 34. Book Contents Chap 01: Introduction, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 34