IRT # 3 IRS Architecture.pdf

3/26/2019
1
1
Dr. Muzammil Khan
Information Retrieval Techniques
Assistant Professor
Department of Computer & Software Technology
Office # 4
2
Lecture 3
IRS Architecture

3/26/2019
2
3
IRS Architecture
 Typically IRS takes Documents, Indexes them, then
Accepts queries, Match query terms with index and Present
results
4
IRS Architecture Requirements
 The IRS should be
 Scalability
 Must handle large document collections
 Index Efficiency
 Must build indexes in a reasonable amount of time
 Query Efficiency
 Queries must run fast
 Query Effectiveness
 Result set must be relevant

3/26/2019
3
5
IRS Contents
 The design of a search application is depends on content
 The contents may be
 Social Network
 Image Library
 A mixed media database
 Etc…
 Basic Media Types
 Text, Graphics, Images, Audio, Video
 Most of the contents handled via textual contents
6
IRS Contents(Cont…)
 Document
 Articles
 Technical reports
 White papers, spreadsheets,
 Presentations,
 Marketing materials,
 Online forms
 Etc…

3/26/2019
4
7
 Web Page
 Includes both
1. Visible Content
 Displayed information
2. Off-the-page HTML tags
 For example
 Title,
 Description, and
 Keywords
8
 Book
 Content plus metadata
 Full-text content, and
 Behavioral metadata

3/26/2019
5
9
 Object
 In the absence of full text
 Metadata is often indexed
10
 Complexity of the information retrieval challenge
 Increases exponentially with linear increases in volume
 The most dramatic way
 To improve performance is to search less content
 Shrink the search space by removing the ROT
 Redundant Outdated Trivial information
 Metadata Provide filters
 So, users can slice up the search space
To Search better, Search less

3/26/2019
6
11
Textual Document
 A textual document is
 A digital object consisting of a sequence of words and other
symbols
 For example, punctuations etc…
 Varying units (That can be considered)
 Strings and segments
 Tokens and words
 Phrases and entities
 Sentences and concepts
 For retrieval
 The individual words and other groups of symbols used for
retrieval are known as terms
12
Textual Document (Cont…)
 A Textual document can be
 Free Text
 Unstructured text
 Continuous sequence of terms
 Fielded Text
 Structured text
 Text is broken into sections that are distinguished by tags
or other markup,
 e.g., a library catalog

3/26/2019
7
13
Understand the Document Structure
 Identify document structure
 Titles, sections, paragraphs, image captions, sentences, etc.
 This is physical structure, also a conceptual structure
 Content and Attributes
 Content = words
 Attributes = author, title, date, etc.
14
Text Representation
 Means
 The methods of representing unstructured text
 There are different stages involved in creating a
representation of text
 Text Operations
 To translate natural language documents into machine
usable form
 So that machine operations are performed
 Two aspects of Representation
1. Description
2. Discrimination

3/26/2019
8
15
Text Representation (Cont…)
 Description
 What is the content of a document?
 Important so
 We can recognize which documents might be relevant to user
 To show user all relevant documents
 Discrimination
 How do I distinguish this document from other documents?
 Important so
 We don’t retrieve trash
 Only show to user relevant documents
 Description & Discrimination act against each other
 Good Representation is a balance
16
Indexing
 In general, Indexing is
 The organization of data according to a specific schema/plan
 The way to get an unordered table into an order that will
maximize the query’s efficiency while searching
 Search Engine Indexing is
 To collects, parses, and stores data to facilitate fast and
accurate information retrieval
 Index design incorporates interdisciplinary concepts from
 Linguistics,
 Cognitive psychology,
 Mathematics,
 Informatics, and Computer Science

3/26/2019
9
17
Indexing (Cont…)
 Index is Data Structure for Searching
 Selection of Data for Indexing ?
 Index Term
 A word or phrase that denotes (describes) a concept &
connotes (implies) a class
 Requirements
 Represent documents appropriately
 Enable efficient and effective search
 Limit storage
 Should be Tradeoff with respect to SPEED
18
Index & Searching
 Indexing and searching are inexorably connected
 Indexing of documents or objects is done in order to make
it searchable
 It cannot search
 If not first indexed in some manner or other
 Many ways to do indexing
 To index one needs an indexing language (> 1)
 Even taking every word in a document is an
indexing language
 Knowing Searching is knowing Indexing

3/26/2019
10
19
Logical View of Document
 Way head to Indexing
20
Indexing Example
 UIN (User Information Need) is
 What is Capital of Pakistan ?
 Doc. 1: The capital of Pakistan is called Islamabad
 Doc. 2: Islamabad is the capital of Pakistan
 Doc. 3: The capitals of Pakistan and India are Islamabad
and Delhi, respectively
 Simple Query “Capital of Pakistan”
 Boolean Query “Capital AND Pakistan”
 Delivers Doc. 1 and Doc. 2 as results
 Naive approach
 Scanning and Text Matching
 Can we do this more efficiently?

3/26/2019
11
21
Indexing Example (Cont…)
 Build a matrix / table / index with
 Columns = Documents
 Rows = All appearing words (alphabetically sorted)
 In Boolean Matching
 1 = Word appears in doc
 0 = Word does not appear
 Of the documents
 Doc. 1: The capital of Pakistan is called Islamabad
 Doc. 2: Islamabad is the capital of Pakistan
 Doc. 3: The capitals of Pakistan and India are called
Islamabad and Delhi, respectively
22

3/26/2019
12
23
24
Term-Document Incidence Matrix
 Search
 Very easy, but
 Storage
 Not very efficient
 For example,
 10,00,000 docs, 1,000 terms = Matrix with 1,00,00,00,000
cells
 The Good news
 This matrix is very sparse (lots of 0’s, only few 1’s)
 Idea : Just store the ‘hits’ (term incidences)
 So, data structure Inverted Index

3/26/2019
13
25
Inverted Index (Interchanging C & R)
26
Creating Inverted Index
 For each term “t”  Store all documents that contain “t”

3/26/2019
14
27
Creating Index
 Documents are parsed to extract tokens
 Each document is
assigned a unique document id
Doc 1 Doc 2
28
Sorting…
Doc 1
Doc 2

3/26/2019
15
29
Term Frequency
 Term Frequency means
 The multiple occurrence of a term in a document
 Multiple term entries for a single document are merged
 Within-document term frequency information is
compiled
 A numeric value is generated
 For example,
 Compare ‘the’ in both files
30
TF (Cont…)

3/26/2019
16
31
Inverted Index
 Inverted Index create two split files
 Dictionary
 Is a Vocabulary file contain all the entries from the
documents
 Posting File
 Is a Postings file containing documents entries with
frequencies
How we can improve ??

IRT # 3 IRS Architecture.pdf

More Related Content

Similar to IRT # 3 IRS Architecture.pdf

Recently uploaded

IRT # 3 IRS Architecture.pdf