SlideShare a Scribd company logo
 Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in
order to speed up the searching task.
 Structure of inverted file:
◦ Vocabulary: is the set of all distinct words in the
text
◦ Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
 Inverted file index is list of terms that appear in the
document collection (called a lexicon or vocabulary) and
for each term in the lexicon, stores a list of pointers to all
occurrences of that term in the document collection. This
list is called an inverted list.
 Granularity of an index determines the accuracy of
representation of the location of the word
◦ Coarse-grained index requires less storage and more
query processing to eliminate false matches
◦ Word-level index enables queries involving adjacency
and proximity, but has higher space requirements
4
Indexed
Terms
Number of
occurrences
Occurrences Lists
Vocabulary
Posting File
This could be a tree like structure !
5
 Text:
 Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
 Prior example allows for boolean
queries.
 Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k …
dk : document frequency of term k
doci : i-th document that contains term k
fik : term frequency of term k in document i
 The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(nβ
), where β is a constant between
0.4 and 0.6 in practice
◦ TREC-2: 1 GB text, 5 MB lexicon
 On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
 To reduce space requirements, a technique
called block addressing is used
 The text is divided in blocks
 The occurrences point to the blocks where the
word appears
 Advantages:
◦ the number of pointers is smaller than positions
◦ all the occurrences of a word inside a single block
are collapsed to one reference
 Disadvantages:
◦ online search over the qualifying blocks if exact
positions are required
 Text:
 Inverted file
beautiful
flowers
garden
house
4
3
2
1
Vocabulary Occurrences
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
 How big are inverted files?
◦ In relation to original collection size
 right column indexes stopwords while left removes
stopwords
 Blocks require text to be available for location of
terms within blocks.
45%
27%
18%
73%
41%
25%
36%
18%
1.7%
64%
32%
2.4%
35%
5%
0.5%
63%
9%
0.7%
Addressing words
Addressing 256 blocks
Addressing 64K blocks
Index Small collection
(1Mb)
Medium collection
(200Mb)
Large collection
(2Gb)
 The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
 Searching inverted files starts with vocabulary
◦ store the vocabulary in a separate file
 Structures used to store the vocabulary
include
◦ Hashing : O (1) lookup, does not support range
queries
◦ Tries : O (c) lookup, c = length (word)
◦ B-trees : O (log v) lookup
 An alternative is simply storing the words in
lexicographical order
◦ cheaper in space and very competitive with O(log
v) cost
 All the vocabulary is kept in a suitable data
structure storing for each word and a list of
its occurrences
 Each word of each text in the corpus is
read and searched for in the vocabulary
 If it is not found, it is added to the
vocabulary with a empty list of occurrences
 The new position is added to the end of its
list of occurrences for the word
 Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
 Two files are created:
◦ in the first file, each list of word occurrences is
stored contiguously
◦ in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
 The overall process is O(n) worst-case time
 An option is to use the previous algorithm until
the main memory is exhausted. When no
more memory is available, the partial index Ii
obtained up to now is written to disk and
erased the main memory before continuing
with the rest of the text
 Once the text is exhausted, a number of
partial indices Ii exist on disk
 The partial indices are merged to obtain the
final index
I 1...8
I 1...4 I 5...8
I 1...2 I 3...4 I 5...6 I 7...8
I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8
1 2 4 5
3 6
7
final index
initial dumps
level 1
level 2
level 3
 The total time to generate partial indices is
O(n)
 The number of partial indices is O(n/M)
 To merge the O(n/M) partial indices are
necessary log2(n/M) merging levels
 The total cost of this algorithm is O(n log(n/M))
 Inverted files are used to index text
 The indices are appropriate when the
text collection is large and semi-static
 If the text collection is volatile online
searching is the only option
 Some techniques combine online and
indexed searching
 Vocabulary List
◦ Text preprocessing modules
 lexical analysis, stemming, stopwords
 Occurrences of Vocabulary Terms
◦ Inverted index creation
 term frequency in documents, document frequency
 Retrieval and Ranking Algorithm
 Query and Ranking Interfaces
 Browsing/Visualization Interface

More Related Content

What's hot

Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Vaibhav Khanna
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
Primya Tamil
 
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)
Jeet Das
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
Tharuka Vishwajith Sarathchandra
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
Primya Tamil
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashing
sathish sak
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexing
Nadia Nahar
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
ssbd6985
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
Vaibhav Khanna
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
Mainul Hassan
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
hina firdaus
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 

What's hot (20)

Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashing
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexing
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 

Viewers also liked

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
 
The search engine index
The search engine indexThe search engine index
The search engine index
CJ Jenkins
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
Lucidworks (Archived)
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital Transformation
Kasun Indrasiri
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
Victor de Boer
 
Product quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportProduct quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-report
Lostmarble
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
IGEEKS TECHNOLOGIES
 
Information seeking
Information seekingInformation seeking
Information seeking
Johan Koren
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
Itamar
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
Mindfire Solutions
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Solr
SolrSolr
Solr
sortivo
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSAAmit Debnath
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalidKhalid Mahmood
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
abial
 
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD... A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
Nexgen Technology
 

Viewers also liked (20)

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital Transformation
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Product quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportProduct quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-report
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Information seeking
Information seekingInformation seeking
Information seeking
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Solr
SolrSolr
Solr
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSA
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Index types
Index typesIndex types
Index types
 
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD... A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 

Similar to Inverted index

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
Habtamu100
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
JemalNesre1
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Ir 03
Ir   03Ir   03
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
Search pitb
Search pitbSearch pitb
Search pitb
Nawab Iqbal
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
Prof Ansari
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
Zainab Almugbel
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
Editor IJMTER
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
lekhacce
 
Chapter13
Chapter13Chapter13
Chapter13
gourab87
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
peter1097
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
MBablu1
 
Hashing
HashingHashing
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
pradeepa velmurugan
 
DBMS-Unit5-PPT.pptx important for revision
DBMS-Unit5-PPT.pptx important for revisionDBMS-Unit5-PPT.pptx important for revision
DBMS-Unit5-PPT.pptx important for revision
yuvivarmaa
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
MedinaBedru
 

Similar to Inverted index (20)

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Ir 03
Ir   03Ir   03
Ir 03
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
 
Chapter13
Chapter13Chapter13
Chapter13
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Hashing
HashingHashing
Hashing
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
DBMS-Unit5-PPT.pptx important for revision
DBMS-Unit5-PPT.pptx important for revisionDBMS-Unit5-PPT.pptx important for revision
DBMS-Unit5-PPT.pptx important for revision
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
G0361034038
G0361034038G0361034038
G0361034038
 

Recently uploaded

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 

Recently uploaded (20)

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 

Inverted index

  • 1.
  • 2.  Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.  Structure of inverted file: ◦ Vocabulary: is the set of all distinct words in the text ◦ Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)
  • 3.  Inverted file index is list of terms that appear in the document collection (called a lexicon or vocabulary) and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list.  Granularity of an index determines the accuracy of representation of the location of the word ◦ Coarse-grained index requires less storage and more query processing to eliminate false matches ◦ Word-level index enables queries involving adjacency and proximity, but has higher space requirements
  • 5. 5  Text:  Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 Vocabulary Occurrences
  • 6.  Prior example allows for boolean queries.  Need the document frequency and term frequency. Vocabulary entry Posting file entry k dk doc1 f1k doc2 f2k … dk : document frequency of term k doci : i-th document that contains term k fik : term frequency of term k in document i
  • 7.  The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(nβ ), where β is a constant between 0.4 and 0.6 in practice ◦ TREC-2: 1 GB text, 5 MB lexicon  On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n)  To reduce space requirements, a technique called block addressing is used
  • 8.  The text is divided in blocks  The occurrences point to the blocks where the word appears  Advantages: ◦ the number of pointers is smaller than positions ◦ all the occurrences of a word inside a single block are collapsed to one reference  Disadvantages: ◦ online search over the qualifying blocks if exact positions are required
  • 9.  Text:  Inverted file beautiful flowers garden house 4 3 2 1 Vocabulary Occurrences Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful
  • 10.  How big are inverted files? ◦ In relation to original collection size  right column indexes stopwords while left removes stopwords  Blocks require text to be available for location of terms within blocks. 45% 27% 18% 73% 41% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 5% 0.5% 63% 9% 0.7% Addressing words Addressing 256 blocks Addressing 64K blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)
  • 11.  The search algorithm on an inverted index follows three steps: 1. Vocabulary search: the words present in the query are located in the vocabulary 2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved 3. Manipulation of occurrences: the occurrences are processed to solve the query
  • 12.  Searching inverted files starts with vocabulary ◦ store the vocabulary in a separate file  Structures used to store the vocabulary include ◦ Hashing : O (1) lookup, does not support range queries ◦ Tries : O (c) lookup, c = length (word) ◦ B-trees : O (log v) lookup  An alternative is simply storing the words in lexicographical order ◦ cheaper in space and very competitive with O(log v) cost
  • 13.  All the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrences  Each word of each text in the corpus is read and searched for in the vocabulary  If it is not found, it is added to the vocabulary with a empty list of occurrences  The new position is added to the end of its list of occurrences for the word
  • 14.  Once the text is exhausted the vocabulary is written to disk with the list of occurrences.  Two files are created: ◦ in the first file, each list of word occurrences is stored contiguously ◦ in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time  The overall process is O(n) worst-case time
  • 15.  An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text  Once the text is exhausted, a number of partial indices Ii exist on disk  The partial indices are merged to obtain the final index
  • 16. I 1...8 I 1...4 I 5...8 I 1...2 I 3...4 I 5...6 I 7...8 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 1 2 4 5 3 6 7 final index initial dumps level 1 level 2 level 3
  • 17.  The total time to generate partial indices is O(n)  The number of partial indices is O(n/M)  To merge the O(n/M) partial indices are necessary log2(n/M) merging levels  The total cost of this algorithm is O(n log(n/M))
  • 18.  Inverted files are used to index text  The indices are appropriate when the text collection is large and semi-static  If the text collection is volatile online searching is the only option  Some techniques combine online and indexed searching
  • 19.  Vocabulary List ◦ Text preprocessing modules  lexical analysis, stemming, stopwords  Occurrences of Vocabulary Terms ◦ Inverted index creation  term frequency in documents, document frequency  Retrieval and Ranking Algorithm  Query and Ranking Interfaces  Browsing/Visualization Interface