SlideShare a Scribd company logo
1 of 49
Download to read offline
ChapterThree: Indexing
Structure
1
Introduction to
Information Storage and Retrieval
Indexing: Basic Concepts
 The usual unit for indexing is the word
 Index terms - are used to look up records in a file.
 Indexing is an arrangement of index terms to permit fast
searching and reading memory spaces
 used to speed up access to desired information from document collection as per
users query such that
 It enhances efficiency in terms of time for retrieval. Relevant documents are searched and
retrieved quick
 Index file usually has index terms in a sorted order. Which list is easier to search?
unsorted list:
or
sorted list:
 Index files are much smaller than the original file.
 Remember Heaps Law: in 1 GB of text collection the vocabulary might have a size of close
to 5 MB. This size may be further reduced by text operation
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
•2
Indexing: Basic Concepts
 How to retrieval information?
 A simple alternative is to search the whole text
sequentially (online search)
 Another option is to build data structures over
the text (called indices) to speed up the search
•3
Major Steps in Index Construction
Source file: Collection of text document
A document is a collection of words/terms and other informational elements
IndexTerms Selection: apply text operations or preprocessing
Tokenize: identify words in a document, so that each document is represented
by a list of keywords or attributes
Stop words removal: words with high frequency are non-content bearing and
needs to be removed from text collection
Stemming: reduce words with similar meaning into their stem/root word
Term weighting: Different index terms have varying relevance when used to
describe document contents. This effect is captured through the assignment of
numerical weights to each index term of a document.There are different index
terms weighting methods: including TF, IDF,TF*IDF, …
Indexing structure: a set of index terms (vocabulary) are organized in
Index File to easily identify documents in which each term occurs in.
•4
Basic Indexing Process
Linguistic
pre-processor
•5
Index file Evaluation Metrics
Running time of the main operations
Access/search time
 How much is the running time to find the required search key from the list?
Update time (Insertion time, Deletion time)
How much time it takes to update existing records in an attempt to:
 Add new terms,
 Delete existing unnecessary terms?
 Change the term-weight of an existing term
Does the index structure allows incremental update or re-indexing?
Space overhead
Computer storage space consumed for keeping the list.
•6
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specific search term appear?
Where, within each document does each term appear? (There may
be several occurrences of a term in a document.)
For organizing index file for a collection of documents,
there are various options available:
Decide what data structure and/or file structure to use.
Is it sequential file, inverted file, suffix tree, etc. ?
•7
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary (unique list of words) as well as linking pointers.
•The records are generally arranged serially, one after another, but
in lexicographic order on the value of some key field.
• a particular attribute is chosen as primary key whose value will determine
the order of the records.
• when the first key fails to discriminate among records, a second key is
chosen to give an order.
•8
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
I did enact Julius
Caesar I was killed
I the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus has told you
Caesar was ambitious
Doc 2
•9
After all documents
have been tokenized,
stopwords are
removed, and
normalization and
stemming are
applied, then
generate index terms
These index terms in
sequential file are
sorted in alphabetical
order
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
I 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Sorting the Vocabulary
Term Doc#
1 ambition 2
2 brutus 1
3 brutus 2
4 capitol 1
5 caesar 1
6 caesar 2
7 caesar 2
8 enact 1
9 julius 1
10 kill 1
11 kill 1
12 noble 2
Sequential file
•10
Sequential File
To access records we have to search serially;
starting at the first record read and investigate all the succeeding
records until the required record is found or end of the file is reached.
Its main advantages:
easy to implement;
provides fast access to the next record using lexicographic order.
Can be searched quickly,using binary search,O(log n)
 Question:“What is the Update options”
 Is the index needs to be rebuilt or incremental update is supported?
Its disadvantages:
No weights attached to terms. Individual words are treated
independently
Random access is slow: since similar terms are indexed individually, we
need to find all terms that match with the query
•11
Inverted file
A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
 Building and maintaining an inverted index is a relatively low cost and low
risk.
 On a text of n words an inverted index can be built in O(n) time
 This list is inverted from a list of terms in location order to a list of terms in
alphabetical order.
Original
Documents
Document IDs
Word Extraction Word IDs
•W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
•Inverted Files
•12
Inverted file
Data to be held in the inverted file includes
1) The vocabulary (List of terms): is the set of all
distinct words (index terms) in the text collection.
Having information about vocabulary (list of terms) speeds searching for
relevant documents
For each term: it contains information related to
i. Location:all the text locations/positions where the word occurs
ii. frequency of occurrence of terms in a document collection
TFij, number of occurrences of term tj in document di
DFj, number of documents containing tj
CF, total frequency of tj in the corpus n
mi, maximum frequency of a term in di
N, total number of documents in a collection
•13
Inverted file
Having information about the location of each term
within the document helps for:
user interface design: highlight location of search term
proximity based ranking: adjacency and near operators (in
Boolean searching)
Eg: which one is more relevant for a query‘artificial intelligence’
D1: the idea in artificial heart implantation is the intelligence in the field
of science.
D2: the artificial process of intelligent system
D3: the field of artificial intelligence is multi disciplinary.
Having information about frequency is used for:
calculating term weighting (likeTF,TF*IDF, …)
optimizing query processing
•14
Inverted File
Term CF Doc ID TF Location
term 1 3 2
19
29
1
1
1
66
213
45
term 2 4 3
19
22
1
2
1
94
7, 212
56
term 3 1 5 1 43
term 4 3 11
34
2
1
3, 70
40
This is called
an index file.
Text operations
are performed
before building
the index.
Documents are organized by the terms/words they contain
Is it possible to keep all these information during searching?
•15
Construction of Inverted file
An inverted index consists of two files: vocabulary and
posting files
A vocabulary file (Word list):
stores all of the distinct terms (keywords) that appear in any of the
documents (in lexicographical order) and
For each word a pointer to posting file
Records kept for each term j in the word list contains the
following:
term j
number of documents in which term j occurs (dfj)
Collection frequency of term j (its frequency in the whole corpus)
pointer to inverted (postings) list for term j
•16
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of pointers to the
documents that contain that term.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
Each list consists of one or many individual postings
Advantage of dividing inverted file:
Keeping a pointer in the vocabulary to the list in the posting
file allows:
the vocabulary to be kept in memory at search time even for large text
collection, and
Posting file to be kept on disk for accessing the documents to be given to
the user
•17
General structure of Inverted File
 The following figure shows the general structure of inverted
index file.
•18
Organization of Index File
Term DF TF
Pointer
To
posting
term 1 3 3
term 2 3 4
term 3 1 1
term 4 2 3
Inverted
lists
Vocabulary
(word list)
Postings
(inverted list)
Documents
•19
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
I did enact Julius
Caesar I was killed
I the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus has told you
Caesar was ambitious
Doc 2
•20
After all documents have
been tokenized the
inverted file is sorted by
terms
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
has 1
I 1
I 1
I 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
I 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Sorting the Vocabulary
•21
Multiple term
entries in a single
document are
merged and
frequency
information
added
Counting
number of
occurrence of
terms in the
collections helps
to computeTF
Term Doc # TF
ambition 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
enact 1 1
julius 1 1
kill 1 2
noble 2 1
Term Doc #
ambition 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
enact 1
julius 1
kill 1
kill 1
noble 2
Remove stop words, stemming & compute
frequency
•22
The file is commonly split into a Dictionary and a Posting file
Doc # TF
2 1
1 1
2 1
1 1
1 1
2 2
1 1
1 1
1 2
2 1
Term DF CF
ambitious 1 1
brutus 2 2
capitol 1 1
caesar 2 3
enact 1 1
julius 1 1
kill 1 2
noble 1 1
vocabulary
Pointers
Vocabulary and postings file
Term Doc # TF
ambition 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
enact 1 1
julius 1 1
kill 1 2
noble 2 1
posting
•23
Searching on Inverted File
 Since the whole index file is divided into two, searching can be
done faster by loading vocabulary list which takes less memory
even for large document collection
 Using binary Search the searching takes logarithmic time
 The search is in the vocabulary lists
 Updating inverted file is very complex.
 We need to update both vocabulary and posting files
•24
Example: Create Inverted file
 Map the file names to file IDs
 Consider the following Original Documents
Our staff have contributed intellectually and
professionally to the advancements in these fields.
The Department also produced its first PhD graduate in
1994.
Followed by the MSc in Computer Science which was
started in 1991.
The Department launched its first BSc in Computer
Studies in 1987.
The Department of Computer Science was established
in 1984.
D5
D4
D3
D2
D1
•25
Suffix trees
 Suffix tree takes text as one long string. No words.
 It helps to handle:
 Complex queries
 Compacted trie structure
 String indexing.
 Exact set matching problem.
 longest common substring.
 Frequent substring
 Problem: space
 If the query does not need exact matching, then suffix tree
would be the solution
 Find‘ssi’ in‘mississippi’
•26
Find ‘ssi’ in ‘mississippi’
The following suffix tree id developed to handle the word mississippi and
one can find any substring using this suffix tree.
•27
•What are suffix arrays and trees?
• Text indexing data structures
• not word based
• allow search for patterns or
• computation of statistics
•Important Properties
• Size
• Speed of exact matching
• Space required for construction
• Time required for construction
•28
•29
Suffix tree
What is Suffix?A suffix is a substring that exists at the end of the given
string.
 Each position in the text is considered as a text suffix
 If txt=t1t2...ti...tn is a string, thenTi=ti, ti+1...tn is the suffix of txt that starts at
position i, where 1≤ i ≤ n
Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi; Exercise: generate suffix of “technology” ?
T9 = ppi;
T10 = pi;
T11 = i;
•30
Suffix Tree
•A suffix Tree is an ordinary tree in which the input strings
are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the text.
(i.e: First symbol has index 1, last symbol has index n (#of symbols
in text).
• To build the suffix TRIE we use these indices instead of the
actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no duplicate).
•31
Suffix Tree
•Construct suffix tree for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from
left to right as per characters occurrence in the string.
• TEXT: G O O G O L $
POSITION: 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
• This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
•32
Suffix tree
A suffix tree is an extension of suffix
trie that construct aTrie of all the
proper suffixes of S
The suffix tree is created by
compacting unary nodes of the
suffixTRIE.
We store pointers rather than
words in the leaves.
It is also possible to replace strings
in every edge by a pair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
•O
•33
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
{
• $
• b$
• ab$
• bab$
• abab$
• }
•We label each leaf with the
starting point of the
corresponding suffix.
•$
•1
•2
•b
•3
•$ •4
•$
•5
•ab
•ab$
•ab$
•34
Search in suffix tree
Searching for all instances of a substring S in a suffix tree is easy since
any substring of S is the prefix of some suffix.
Pseudo-code for searching in suffix tree:
Start at root
Go down the tree by taking each time the corresponding path
If S correspond to a node then return all leaves in sub-tree
 the places where S can be found are given by the pointers in all the leaves in
the subtree rooted at x.
If S encountered a NIL pointer before reaching the end, then S is
not in the tree
Example: If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
•35
Exercise
 Given the following index terms:
worker, word, world, run & information
construct index file using suffix tree?
•36
Suffix Tree Applications
SuffixTree can be used to solve a large number of string problems
that occur in:
text-editing,
free-text search,
etc.
Some examples of string problems given below can easily be
managed by suffix tree.
String matching
Longest Common Substring
Longest Repeated Substring
Palindromes
etc..
•37
Complexity Analysis
 The suffix tree for a string has been built in O(n2) time.
 Searching is very fast:The search time is linear in the length of
string S.
 The number of leaves is n+1, where n is the number of input strings.
 Furthermore, in the leaves, we may store either the strings themselves or
pointers to the strings (that is, integers).
 Searching for a substring[1..m], in string[1..n], can be solved in
O(m) time.
 Expensive memory-wise
 Suffix trees consume a lot of space
 How many bytes required to store MISSISSIPI ?
•38
ReadingAssignment:
Signature files
•39
End of Chapter 3
•40
Suffix Tree Building Example for -
mississippi
Ukkonen’s Algorithm
•mississippi
•mississippi
•mississippi
Ukkonen’s Algorithm
•mississippi
•mississippi
Ukkonen’s Algorithm -
•mississippi
Ukkonen’s Algorithm -
•mississippi
Ukkonen’s Algorithm -
•mississippi
Ukkonen’s Algorithm -
•mississippi
Ukkonen’s Algorithm -
•mississippi
Ukkonen’s Algorithm -
•mississippi

More Related Content

What's hot

Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture janani thirupathi
 
Negotiation Power Skills Applied in Library Services Management
Negotiation Power Skills Applied in Library Services ManagementNegotiation Power Skills Applied in Library Services Management
Negotiation Power Skills Applied in Library Services ManagementShirley Ingles-Cruz
 
Classaurus classification
Classaurus classificationClassaurus classification
Classaurus classificationavid
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseSOMASUNDARAM T
 
greenstone digital library software
greenstone digital library softwaregreenstone digital library software
greenstone digital library softwaresharon bacalzo
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
1.1 library concepts, terms and systems edited
1.1 library concepts, terms and systems edited1.1 library concepts, terms and systems edited
1.1 library concepts, terms and systems editedChandraSekhar1115
 
Digital library technologies
Digital library technologies Digital library technologies
Digital library technologies Shriram Pandey
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Archives - DACS and EAD
Archives - DACS and EADArchives - DACS and EAD
Archives - DACS and EADsotrue
 
introduction to database
 introduction to database introduction to database
introduction to databaseAkif shexi
 
INFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATIONINFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATIONLibcorpio
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptSUNILKUMARSINGH
 

What's hot (20)

Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
XML Signature
XML SignatureXML Signature
XML Signature
 
Negotiation Power Skills Applied in Library Services Management
Negotiation Power Skills Applied in Library Services ManagementNegotiation Power Skills Applied in Library Services Management
Negotiation Power Skills Applied in Library Services Management
 
Classaurus classification
Classaurus classificationClassaurus classification
Classaurus classification
 
Database Design
Database DesignDatabase Design
Database Design
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
greenstone digital library software
greenstone digital library softwaregreenstone digital library software
greenstone digital library software
 
Xml presentation
Xml presentationXml presentation
Xml presentation
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
1.1 library concepts, terms and systems edited
1.1 library concepts, terms and systems edited1.1 library concepts, terms and systems edited
1.1 library concepts, terms and systems edited
 
Digital library technologies
Digital library technologies Digital library technologies
Digital library technologies
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Dspace software
Dspace softwareDspace software
Dspace software
 
Archives - DACS and EAD
Archives - DACS and EADArchives - DACS and EAD
Archives - DACS and EAD
 
introduction to database
 introduction to database introduction to database
introduction to database
 
INFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATIONINFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATION
 
Dublin core Presentation
Dublin core PresentationDublin core Presentation
Dublin core Presentation
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
 

Similar to Chapter 3 Indexing Structure.pdf

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdfHabtamu100
 
overview of storage and indexing BY-Pratik kadam
overview of storage and indexing BY-Pratik kadam overview of storage and indexing BY-Pratik kadam
overview of storage and indexing BY-Pratik kadam pratikkadam78
 
File organization and introduction of DBMS
File organization and introduction of DBMSFile organization and introduction of DBMS
File organization and introduction of DBMSVrushaliSolanke
 
fileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdffileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdfFraolUmeta
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18karenostil
 
File Structure.pptx
File Structure.pptxFile Structure.pptx
File Structure.pptxzedd15
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptSamuelKetema1
 
Application portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptxApplication portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptxAmanJain384694
 
MARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMaruthiRock
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingZainab Almugbel
 
358 33 powerpoint-slides_16-files-their-organization_chapter-16
358 33 powerpoint-slides_16-files-their-organization_chapter-16358 33 powerpoint-slides_16-files-their-organization_chapter-16
358 33 powerpoint-slides_16-files-their-organization_chapter-16sumitbardhan
 
Fundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of recordsFundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of recordsDevyani Vaidya
 
Database and Research Matrix.pptx
Database and Research Matrix.pptxDatabase and Research Matrix.pptx
Database and Research Matrix.pptxRahulRoshan37
 

Similar to Chapter 3 Indexing Structure.pdf (20)

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
overview of storage and indexing BY-Pratik kadam
overview of storage and indexing BY-Pratik kadam overview of storage and indexing BY-Pratik kadam
overview of storage and indexing BY-Pratik kadam
 
File organization and introduction of DBMS
File organization and introduction of DBMSFile organization and introduction of DBMS
File organization and introduction of DBMS
 
fileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdffileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdf
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
File Structure.pptx
File Structure.pptxFile Structure.pptx
File Structure.pptx
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
File organisation
File organisationFile organisation
File organisation
 
Application portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptxApplication portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptx
 
MARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptx
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
Inverted index
Inverted indexInverted index
Inverted index
 
358 33 powerpoint-slides_16-files-their-organization_chapter-16
358 33 powerpoint-slides_16-files-their-organization_chapter-16358 33 powerpoint-slides_16-files-their-organization_chapter-16
358 33 powerpoint-slides_16-files-their-organization_chapter-16
 
Fundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of recordsFundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of records
 
Database and Research Matrix.pptx
Database and Research Matrix.pptxDatabase and Research Matrix.pptx
Database and Research Matrix.pptx
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
DBMS (UNIT 5)
DBMS (UNIT 5)DBMS (UNIT 5)
DBMS (UNIT 5)
 
File organisation
File organisationFile organisation
File organisation
 
File Management
File ManagementFile Management
File Management
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Chapter 3 Indexing Structure.pdf

  • 2. Indexing: Basic Concepts  The usual unit for indexing is the word  Index terms - are used to look up records in a file.  Indexing is an arrangement of index terms to permit fast searching and reading memory spaces  used to speed up access to desired information from document collection as per users query such that  It enhances efficiency in terms of time for retrieval. Relevant documents are searched and retrieved quick  Index file usually has index terms in a sorted order. Which list is easier to search? unsorted list: or sorted list:  Index files are much smaller than the original file.  Remember Heaps Law: in 1 GB of text collection the vocabulary might have a size of close to 5 MB. This size may be further reduced by text operation fox pig zebra hen ant cat dog lion ox ant cat dog fox hen lion ox pig zebra •2
  • 3. Indexing: Basic Concepts  How to retrieval information?  A simple alternative is to search the whole text sequentially (online search)  Another option is to build data structures over the text (called indices) to speed up the search •3
  • 4. Major Steps in Index Construction Source file: Collection of text document A document is a collection of words/terms and other informational elements IndexTerms Selection: apply text operations or preprocessing Tokenize: identify words in a document, so that each document is represented by a list of keywords or attributes Stop words removal: words with high frequency are non-content bearing and needs to be removed from text collection Stemming: reduce words with similar meaning into their stem/root word Term weighting: Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document.There are different index terms weighting methods: including TF, IDF,TF*IDF, … Indexing structure: a set of index terms (vocabulary) are organized in Index File to easily identify documents in which each term occurs in. •4
  • 6. Index file Evaluation Metrics Running time of the main operations Access/search time  How much is the running time to find the required search key from the list? Update time (Insertion time, Deletion time) How much time it takes to update existing records in an attempt to:  Add new terms,  Delete existing unnecessary terms?  Change the term-weight of an existing term Does the index structure allows incremental update or re-indexing? Space overhead Computer storage space consumed for keeping the list. •6
  • 7. Building Index file An index file of a document is a file consisting of a list of index terms and a link to one or more documents that has the index term An index file is list of search terms that are organized for associative look-up, i.e., to answer user’s query: In which documents does a specific search term appear? Where, within each document does each term appear? (There may be several occurrences of a term in a document.) For organizing index file for a collection of documents, there are various options available: Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix tree, etc. ? •7
  • 8. Sequential File •Sequential file is the most primitive file structures. • It has no vocabulary (unique list of words) as well as linking pointers. •The records are generally arranged serially, one after another, but in lexicographic order on the value of some key field. • a particular attribute is chosen as primary key whose value will determine the order of the records. • when the first key fails to discriminate among records, a second key is chosen to give an order. •8
  • 9. Example: Given a collection of documents, they are parsed to extract words and these are saved with the Document ID. I did enact Julius Caesar I was killed I the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus has told you Caesar was ambitious Doc 2 •9
  • 10. After all documents have been tokenized, stopwords are removed, and normalization and stemming are applied, then generate index terms These index terms in sequential file are sorted in alphabetical order Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 I 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Sorting the Vocabulary Term Doc# 1 ambition 2 2 brutus 1 3 brutus 2 4 capitol 1 5 caesar 1 6 caesar 2 7 caesar 2 8 enact 1 9 julius 1 10 kill 1 11 kill 1 12 noble 2 Sequential file •10
  • 11. Sequential File To access records we have to search serially; starting at the first record read and investigate all the succeeding records until the required record is found or end of the file is reached. Its main advantages: easy to implement; provides fast access to the next record using lexicographic order. Can be searched quickly,using binary search,O(log n)  Question:“What is the Update options”  Is the index needs to be rebuilt or incremental update is supported? Its disadvantages: No weights attached to terms. Individual words are treated independently Random access is slow: since similar terms are indexed individually, we need to find all terms that match with the query •11
  • 12. Inverted file A word oriented indexing mechanism based on sorted list of keywords, with each keyword having links to the documents containing it  Building and maintaining an inverted index is a relatively low cost and low risk.  On a text of n words an inverted index can be built in O(n) time  This list is inverted from a list of terms in location order to a list of terms in alphabetical order. Original Documents Document IDs Word Extraction Word IDs •W1:d1,d2,d3 •W2:d2,d4,d7,d9 •… •Wn :di,…dn •Inverted Files •12
  • 13. Inverted file Data to be held in the inverted file includes 1) The vocabulary (List of terms): is the set of all distinct words (index terms) in the text collection. Having information about vocabulary (list of terms) speeds searching for relevant documents For each term: it contains information related to i. Location:all the text locations/positions where the word occurs ii. frequency of occurrence of terms in a document collection TFij, number of occurrences of term tj in document di DFj, number of documents containing tj CF, total frequency of tj in the corpus n mi, maximum frequency of a term in di N, total number of documents in a collection •13
  • 14. Inverted file Having information about the location of each term within the document helps for: user interface design: highlight location of search term proximity based ranking: adjacency and near operators (in Boolean searching) Eg: which one is more relevant for a query‘artificial intelligence’ D1: the idea in artificial heart implantation is the intelligence in the field of science. D2: the artificial process of intelligent system D3: the field of artificial intelligence is multi disciplinary. Having information about frequency is used for: calculating term weighting (likeTF,TF*IDF, …) optimizing query processing •14
  • 15. Inverted File Term CF Doc ID TF Location term 1 3 2 19 29 1 1 1 66 213 45 term 2 4 3 19 22 1 2 1 94 7, 212 56 term 3 1 5 1 43 term 4 3 11 34 2 1 3, 70 40 This is called an index file. Text operations are performed before building the index. Documents are organized by the terms/words they contain Is it possible to keep all these information during searching? •15
  • 16. Construction of Inverted file An inverted index consists of two files: vocabulary and posting files A vocabulary file (Word list): stores all of the distinct terms (keywords) that appear in any of the documents (in lexicographical order) and For each word a pointer to posting file Records kept for each term j in the word list contains the following: term j number of documents in which term j occurs (dfj) Collection frequency of term j (its frequency in the whole corpus) pointer to inverted (postings) list for term j •16
  • 17. Postings File (Inverted List) For each distinct term in the vocabulary, stores a list of pointers to the documents that contain that term. Each element in an inverted list is called a posting, i.e., the occurrence of a term in a document It is stored as a separate inverted list for each column, i.e., a list corresponding to each term in the index file. Each list consists of one or many individual postings Advantage of dividing inverted file: Keeping a pointer in the vocabulary to the list in the posting file allows: the vocabulary to be kept in memory at search time even for large text collection, and Posting file to be kept on disk for accessing the documents to be given to the user •17
  • 18. General structure of Inverted File  The following figure shows the general structure of inverted index file. •18
  • 19. Organization of Index File Term DF TF Pointer To posting term 1 3 3 term 2 3 4 term 3 1 1 term 4 2 3 Inverted lists Vocabulary (word list) Postings (inverted list) Documents •19
  • 20. Example: Given a collection of documents, they are parsed to extract words and these are saved with the Document ID. I did enact Julius Caesar I was killed I the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus has told you Caesar was ambitious Doc 2 •20
  • 21. After all documents have been tokenized the inverted file is sorted by terms Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 has 1 I 1 I 1 I 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 I 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Sorting the Vocabulary •21
  • 22. Multiple term entries in a single document are merged and frequency information added Counting number of occurrence of terms in the collections helps to computeTF Term Doc # TF ambition 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 enact 1 1 julius 1 1 kill 1 2 noble 2 1 Term Doc # ambition 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 enact 1 julius 1 kill 1 kill 1 noble 2 Remove stop words, stemming & compute frequency •22
  • 23. The file is commonly split into a Dictionary and a Posting file Doc # TF 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 Term DF CF ambitious 1 1 brutus 2 2 capitol 1 1 caesar 2 3 enact 1 1 julius 1 1 kill 1 2 noble 1 1 vocabulary Pointers Vocabulary and postings file Term Doc # TF ambition 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 enact 1 1 julius 1 1 kill 1 2 noble 2 1 posting •23
  • 24. Searching on Inverted File  Since the whole index file is divided into two, searching can be done faster by loading vocabulary list which takes less memory even for large document collection  Using binary Search the searching takes logarithmic time  The search is in the vocabulary lists  Updating inverted file is very complex.  We need to update both vocabulary and posting files •24
  • 25. Example: Create Inverted file  Map the file names to file IDs  Consider the following Original Documents Our staff have contributed intellectually and professionally to the advancements in these fields. The Department also produced its first PhD graduate in 1994. Followed by the MSc in Computer Science which was started in 1991. The Department launched its first BSc in Computer Studies in 1987. The Department of Computer Science was established in 1984. D5 D4 D3 D2 D1 •25
  • 26. Suffix trees  Suffix tree takes text as one long string. No words.  It helps to handle:  Complex queries  Compacted trie structure  String indexing.  Exact set matching problem.  longest common substring.  Frequent substring  Problem: space  If the query does not need exact matching, then suffix tree would be the solution  Find‘ssi’ in‘mississippi’ •26
  • 27. Find ‘ssi’ in ‘mississippi’ The following suffix tree id developed to handle the word mississippi and one can find any substring using this suffix tree. •27
  • 28. •What are suffix arrays and trees? • Text indexing data structures • not word based • allow search for patterns or • computation of statistics •Important Properties • Size • Speed of exact matching • Space required for construction • Time required for construction •28
  • 29. •29
  • 30. Suffix tree What is Suffix?A suffix is a substring that exists at the end of the given string.  Each position in the text is considered as a text suffix  If txt=t1t2...ti...tn is a string, thenTi=ti, ti+1...tn is the suffix of txt that starts at position i, where 1≤ i ≤ n Example: txt = mississippi txt = GOOGOL T1 = mississippi; T1 = GOOGOL T2 = ississippi; T2 = OOGOL T3 = ssissippi; T3 = OGOL T4 = sissippi; T4 = GOL T5 = issippi; T5 = OL T6 = ssippi; T6 = L T7 = sippi; T8 = ippi; Exercise: generate suffix of “technology” ? T9 = ppi; T10 = pi; T11 = i; •30
  • 31. Suffix Tree •A suffix Tree is an ordinary tree in which the input strings are all possible suffixes. –Principles: The idea behind suffix TRIE is to assign to each symbol in a text an index corresponding to its position in the text. (i.e: First symbol has index 1, last symbol has index n (#of symbols in text). • To build the suffix TRIE we use these indices instead of the actual object. •The structure has several advantages: –It requires less storage space. –We do not have to worry how the text is represented (binary, ASCII, etc). –We do not have to store the same object twice (no duplicate). •31
  • 32. Suffix Tree •Construct suffix tree for the following string: GOOGOL •We begin by giving a position to every suffix in the text starting from left to right as per characters occurrence in the string. • TEXT: G O O G O L $ POSITION: 1 2 3 4 5 6 7 •Build a SUFFIX TRIE for all n suffixes of the text. •Note: The resulting tree has n leaves and height n. • This structure is particularly useful for any application requiring prefix based ("starts with") pattern matching. •32
  • 33. Suffix tree A suffix tree is an extension of suffix trie that construct aTrie of all the proper suffixes of S The suffix tree is created by compacting unary nodes of the suffixTRIE. We store pointers rather than words in the leaves. It is also possible to replace strings in every edge by a pair (a,b), where a & b are the beginning and end index of the string. i.e. (3,7) for OGOL$ (1,2) for GO (7,7) for $ •O •33
  • 34. Example: Suffix tree •Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { • $ • b$ • ab$ • bab$ • abab$ • } •We label each leaf with the starting point of the corresponding suffix. •$ •1 •2 •b •3 •$ •4 •$ •5 •ab •ab$ •ab$ •34
  • 35. Search in suffix tree Searching for all instances of a substring S in a suffix tree is easy since any substring of S is the prefix of some suffix. Pseudo-code for searching in suffix tree: Start at root Go down the tree by taking each time the corresponding path If S correspond to a node then return all leaves in sub-tree  the places where S can be found are given by the pointers in all the leaves in the subtree rooted at x. If S encountered a NIL pointer before reaching the end, then S is not in the tree Example: If S = "GO" we take the GO path and return: GOOGOL$,GOL$. If S = "OR" we take the O path and then we hit a NIL pointer so "OR" is not in the tree. •35
  • 36. Exercise  Given the following index terms: worker, word, world, run & information construct index file using suffix tree? •36
  • 37. Suffix Tree Applications SuffixTree can be used to solve a large number of string problems that occur in: text-editing, free-text search, etc. Some examples of string problems given below can easily be managed by suffix tree. String matching Longest Common Substring Longest Repeated Substring Palindromes etc.. •37
  • 38. Complexity Analysis  The suffix tree for a string has been built in O(n2) time.  Searching is very fast:The search time is linear in the length of string S.  The number of leaves is n+1, where n is the number of input strings.  Furthermore, in the leaves, we may store either the strings themselves or pointers to the strings (that is, integers).  Searching for a substring[1..m], in string[1..n], can be solved in O(m) time.  Expensive memory-wise  Suffix trees consume a lot of space  How many bytes required to store MISSISSIPI ? •38
  • 40. End of Chapter 3 •40
  • 41. Suffix Tree Building Example for - mississippi