3/26/2019
1
1
Dr. Muzammil Khan
Information Retrieval Techniques
Assistant Professor
Department of Computer & Software Technology
Office # 4
2
Lecture 3
IRS Architecture
Information Retrieval Techniques
3/26/2019
2
3
IRS Architecture
Information Retrieval Techniques
 Typically IRS takes Documents, Indexes them, then
Accepts queries, Match query terms with index and Present
results
4
IRS Architecture Requirements
Information Retrieval Techniques
 The IRS should be
 Scalability
 Must handle large document collections
 Index Efficiency
 Must build indexes in a reasonable amount of time
 Query Efficiency
 Queries must run fast
 Query Effectiveness
 Result set must be relevant
3/26/2019
3
5
IRS Contents
Information Retrieval Techniques
 The design of a search application is depends on content
 The contents may be
 Social Network
 Image Library
 A mixed media database
 Etc…
 Basic Media Types
 Text, Graphics, Images, Audio, Video
 Most of the contents handled via textual contents
6
IRS Contents(Cont…)
Information Retrieval Techniques
 Document
 Articles
 Technical reports
 White papers, spreadsheets,
 Presentations,
 Marketing materials,
 Online forms
 Etc…
3/26/2019
4
7
IRS Contents(Cont…)
Information Retrieval Techniques
 Web Page
 Includes both
1. Visible Content
 Displayed information
2. Off-the-page HTML tags
 For example
 Title,
 Description, and
 Keywords
8
IRS Contents(Cont…)
Information Retrieval Techniques
 Book
 Content plus metadata
 Full-text content, and
 Behavioral metadata
3/26/2019
5
9
IRS Contents(Cont…)
Information Retrieval Techniques
 Object
 In the absence of full text
 Metadata is often indexed
10
IRS Contents(Cont…)
Information Retrieval Techniques
 Complexity of the information retrieval challenge
 Increases exponentially with linear increases in volume
 The most dramatic way
 To improve performance is to search less content
 Shrink the search space by removing the ROT
 Redundant Outdated Trivial information
 Metadata Provide filters
 So, users can slice up the search space
To Search better, Search less
3/26/2019
6
11
Textual Document
Information Retrieval Techniques
 A textual document is
 A digital object consisting of a sequence of words and other
symbols
 For example, punctuations etc…
 Varying units (That can be considered)
 Strings and segments
 Tokens and words
 Phrases and entities
 Sentences and concepts
 For retrieval
 The individual words and other groups of symbols used for
retrieval are known as terms
12
Textual Document (Cont…)
Information Retrieval Techniques
 A Textual document can be
 Free Text
 Unstructured text
 Continuous sequence of terms
 Fielded Text
 Structured text
 Text is broken into sections that are distinguished by tags
or other markup,
 e.g., a library catalog
3/26/2019
7
13
Understand the Document Structure
Information Retrieval Techniques
 Identify document structure
 Titles, sections, paragraphs, image captions, sentences, etc.
 This is physical structure, also a conceptual structure
 Content and Attributes
 Content = words
 Attributes = author, title, date, etc.
14
Text Representation
Information Retrieval Techniques
 Means
 The methods of representing unstructured text
 There are different stages involved in creating a
representation of text
 Text Operations
 To translate natural language documents into machine
usable form
 So that machine operations are performed
 Two aspects of Representation
1. Description
2. Discrimination
3/26/2019
8
15
Text Representation (Cont…)
Information Retrieval Techniques
 Description
 What is the content of a document?
 Important so
 We can recognize which documents might be relevant to user
 To show user all relevant documents
 Discrimination
 How do I distinguish this document from other documents?
 Important so
 We don’t retrieve trash
 Only show to user relevant documents
 Description & Discrimination act against each other
 Good Representation is a balance
16
Indexing
Information Retrieval Techniques
 In general, Indexing is
 The organization of data according to a specific schema/plan
 The way to get an unordered table into an order that will
maximize the query’s efficiency while searching
 Search Engine Indexing is
 To collects, parses, and stores data to facilitate fast and
accurate information retrieval
 Index design incorporates interdisciplinary concepts from
 Linguistics,
 Cognitive psychology,
 Mathematics,
 Informatics, and Computer Science
3/26/2019
9
17
Indexing (Cont…)
Information Retrieval Techniques
 Index is Data Structure for Searching
 Selection of Data for Indexing ?
 Index Term
 A word or phrase that denotes (describes) a concept &
connotes (implies) a class
 Requirements
 Represent documents appropriately
 Enable efficient and effective search
 Limit storage
 Should be Tradeoff with respect to SPEED
18
Index & Searching
Information Retrieval Techniques
 Indexing and searching are inexorably connected
 Indexing of documents or objects is done in order to make
it searchable
 It cannot search
 If not first indexed in some manner or other
 Many ways to do indexing
 To index one needs an indexing language (> 1)
 Even taking every word in a document is an
indexing language
 Knowing Searching is knowing Indexing
3/26/2019
10
19
Logical View of Document
Information Retrieval Techniques
 Way head to Indexing
20
Indexing Example
Information Retrieval Techniques
 UIN (User Information Need) is
 What is Capital of Pakistan ?
 Doc. 1: The capital of Pakistan is called Islamabad
 Doc. 2: Islamabad is the capital of Pakistan
 Doc. 3: The capitals of Pakistan and India are Islamabad
and Delhi, respectively
 Simple Query “Capital of Pakistan”
 Boolean Query “Capital AND Pakistan”
 Delivers Doc. 1 and Doc. 2 as results
 Naive approach
 Scanning and Text Matching
 Can we do this more efficiently?
3/26/2019
11
21
Indexing Example (Cont…)
Information Retrieval Techniques
 Build a matrix / table / index with
 Columns = Documents
 Rows = All appearing words (alphabetically sorted)
 In Boolean Matching
 1 = Word appears in doc
 0 = Word does not appear
 Of the documents
 Doc. 1: The capital of Pakistan is called Islamabad
 Doc. 2: Islamabad is the capital of Pakistan
 Doc. 3: The capitals of Pakistan and India are called
Islamabad and Delhi, respectively
22
Indexing Example (Cont…)
Information Retrieval Techniques
3/26/2019
12
23
Indexing Example (Cont…)
Information Retrieval Techniques
24
Term-Document Incidence Matrix
Information Retrieval Techniques
 Search
 Very easy, but
 Storage
 Not very efficient
 For example,
 10,00,000 docs, 1,000 terms = Matrix with 1,00,00,00,000
cells
 The Good news
 This matrix is very sparse (lots of 0’s, only few 1’s)
 Idea : Just store the ‘hits’ (term incidences)
 So, data structure Inverted Index
3/26/2019
13
25
Inverted Index (Interchanging C & R)
Information Retrieval Techniques
26
Creating Inverted Index
Information Retrieval Techniques
 For each term “t”  Store all documents that contain “t”
3/26/2019
14
27
Creating Index
Information Retrieval Techniques
 Documents are parsed to extract tokens
 Each document is
assigned a unique document id
Doc 1 Doc 2
28
Sorting…
Information Retrieval Techniques
Doc 1
Doc 2
3/26/2019
15
29
Term Frequency
Information Retrieval Techniques
 Term Frequency means
 The multiple occurrence of a term in a document
 Multiple term entries for a single document are merged
 Within-document term frequency information is
compiled
 A numeric value is generated
 For example,
 Compare ‘the’ in both files
30
TF (Cont…)
Information Retrieval Techniques
3/26/2019
16
31
Inverted Index
Information Retrieval Techniques
 Inverted Index create two split files
 Dictionary
 Is a Vocabulary file contain all the entries from the
documents
 Posting File
 Is a Postings file containing documents entries with
frequencies
How we can improve ??

IRT # 3 IRS Architecture.pdf

  • 1.
    3/26/2019 1 1 Dr. Muzammil Khan InformationRetrieval Techniques Assistant Professor Department of Computer & Software Technology Office # 4 2 Lecture 3 IRS Architecture Information Retrieval Techniques
  • 2.
    3/26/2019 2 3 IRS Architecture Information RetrievalTechniques  Typically IRS takes Documents, Indexes them, then Accepts queries, Match query terms with index and Present results 4 IRS Architecture Requirements Information Retrieval Techniques  The IRS should be  Scalability  Must handle large document collections  Index Efficiency  Must build indexes in a reasonable amount of time  Query Efficiency  Queries must run fast  Query Effectiveness  Result set must be relevant
  • 3.
    3/26/2019 3 5 IRS Contents Information RetrievalTechniques  The design of a search application is depends on content  The contents may be  Social Network  Image Library  A mixed media database  Etc…  Basic Media Types  Text, Graphics, Images, Audio, Video  Most of the contents handled via textual contents 6 IRS Contents(Cont…) Information Retrieval Techniques  Document  Articles  Technical reports  White papers, spreadsheets,  Presentations,  Marketing materials,  Online forms  Etc…
  • 4.
    3/26/2019 4 7 IRS Contents(Cont…) Information RetrievalTechniques  Web Page  Includes both 1. Visible Content  Displayed information 2. Off-the-page HTML tags  For example  Title,  Description, and  Keywords 8 IRS Contents(Cont…) Information Retrieval Techniques  Book  Content plus metadata  Full-text content, and  Behavioral metadata
  • 5.
    3/26/2019 5 9 IRS Contents(Cont…) Information RetrievalTechniques  Object  In the absence of full text  Metadata is often indexed 10 IRS Contents(Cont…) Information Retrieval Techniques  Complexity of the information retrieval challenge  Increases exponentially with linear increases in volume  The most dramatic way  To improve performance is to search less content  Shrink the search space by removing the ROT  Redundant Outdated Trivial information  Metadata Provide filters  So, users can slice up the search space To Search better, Search less
  • 6.
    3/26/2019 6 11 Textual Document Information RetrievalTechniques  A textual document is  A digital object consisting of a sequence of words and other symbols  For example, punctuations etc…  Varying units (That can be considered)  Strings and segments  Tokens and words  Phrases and entities  Sentences and concepts  For retrieval  The individual words and other groups of symbols used for retrieval are known as terms 12 Textual Document (Cont…) Information Retrieval Techniques  A Textual document can be  Free Text  Unstructured text  Continuous sequence of terms  Fielded Text  Structured text  Text is broken into sections that are distinguished by tags or other markup,  e.g., a library catalog
  • 7.
    3/26/2019 7 13 Understand the DocumentStructure Information Retrieval Techniques  Identify document structure  Titles, sections, paragraphs, image captions, sentences, etc.  This is physical structure, also a conceptual structure  Content and Attributes  Content = words  Attributes = author, title, date, etc. 14 Text Representation Information Retrieval Techniques  Means  The methods of representing unstructured text  There are different stages involved in creating a representation of text  Text Operations  To translate natural language documents into machine usable form  So that machine operations are performed  Two aspects of Representation 1. Description 2. Discrimination
  • 8.
    3/26/2019 8 15 Text Representation (Cont…) InformationRetrieval Techniques  Description  What is the content of a document?  Important so  We can recognize which documents might be relevant to user  To show user all relevant documents  Discrimination  How do I distinguish this document from other documents?  Important so  We don’t retrieve trash  Only show to user relevant documents  Description & Discrimination act against each other  Good Representation is a balance 16 Indexing Information Retrieval Techniques  In general, Indexing is  The organization of data according to a specific schema/plan  The way to get an unordered table into an order that will maximize the query’s efficiency while searching  Search Engine Indexing is  To collects, parses, and stores data to facilitate fast and accurate information retrieval  Index design incorporates interdisciplinary concepts from  Linguistics,  Cognitive psychology,  Mathematics,  Informatics, and Computer Science
  • 9.
    3/26/2019 9 17 Indexing (Cont…) Information RetrievalTechniques  Index is Data Structure for Searching  Selection of Data for Indexing ?  Index Term  A word or phrase that denotes (describes) a concept & connotes (implies) a class  Requirements  Represent documents appropriately  Enable efficient and effective search  Limit storage  Should be Tradeoff with respect to SPEED 18 Index & Searching Information Retrieval Techniques  Indexing and searching are inexorably connected  Indexing of documents or objects is done in order to make it searchable  It cannot search  If not first indexed in some manner or other  Many ways to do indexing  To index one needs an indexing language (> 1)  Even taking every word in a document is an indexing language  Knowing Searching is knowing Indexing
  • 10.
    3/26/2019 10 19 Logical View ofDocument Information Retrieval Techniques  Way head to Indexing 20 Indexing Example Information Retrieval Techniques  UIN (User Information Need) is  What is Capital of Pakistan ?  Doc. 1: The capital of Pakistan is called Islamabad  Doc. 2: Islamabad is the capital of Pakistan  Doc. 3: The capitals of Pakistan and India are Islamabad and Delhi, respectively  Simple Query “Capital of Pakistan”  Boolean Query “Capital AND Pakistan”  Delivers Doc. 1 and Doc. 2 as results  Naive approach  Scanning and Text Matching  Can we do this more efficiently?
  • 11.
    3/26/2019 11 21 Indexing Example (Cont…) InformationRetrieval Techniques  Build a matrix / table / index with  Columns = Documents  Rows = All appearing words (alphabetically sorted)  In Boolean Matching  1 = Word appears in doc  0 = Word does not appear  Of the documents  Doc. 1: The capital of Pakistan is called Islamabad  Doc. 2: Islamabad is the capital of Pakistan  Doc. 3: The capitals of Pakistan and India are called Islamabad and Delhi, respectively 22 Indexing Example (Cont…) Information Retrieval Techniques
  • 12.
    3/26/2019 12 23 Indexing Example (Cont…) InformationRetrieval Techniques 24 Term-Document Incidence Matrix Information Retrieval Techniques  Search  Very easy, but  Storage  Not very efficient  For example,  10,00,000 docs, 1,000 terms = Matrix with 1,00,00,00,000 cells  The Good news  This matrix is very sparse (lots of 0’s, only few 1’s)  Idea : Just store the ‘hits’ (term incidences)  So, data structure Inverted Index
  • 13.
    3/26/2019 13 25 Inverted Index (InterchangingC & R) Information Retrieval Techniques 26 Creating Inverted Index Information Retrieval Techniques  For each term “t”  Store all documents that contain “t”
  • 14.
    3/26/2019 14 27 Creating Index Information RetrievalTechniques  Documents are parsed to extract tokens  Each document is assigned a unique document id Doc 1 Doc 2 28 Sorting… Information Retrieval Techniques Doc 1 Doc 2
  • 15.
    3/26/2019 15 29 Term Frequency Information RetrievalTechniques  Term Frequency means  The multiple occurrence of a term in a document  Multiple term entries for a single document are merged  Within-document term frequency information is compiled  A numeric value is generated  For example,  Compare ‘the’ in both files 30 TF (Cont…) Information Retrieval Techniques
  • 16.
    3/26/2019 16 31 Inverted Index Information RetrievalTechniques  Inverted Index create two split files  Dictionary  Is a Vocabulary file contain all the entries from the documents  Posting File  Is a Postings file containing documents entries with frequencies How we can improve ??