• Save
Lucece Indexing
Upcoming SlideShare
Loading in...5
×
 

Lucece Indexing

on

  • 5,473 views

 

Statistics

Views

Total Views
5,473
Views on SlideShare
5,454
Embed Views
19

Actions

Likes
12
Downloads
0
Comments
0

2 Embeds 19

http://www.slideshare.net 18
http://s3.amazonaws.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lucece Indexing Lucece Indexing Presentation Transcript

  • LUCENE Indexing.....
  • What is a Search Index
    • Search Index is a variant of Database
    • Similarities with traditional RDBMS
      • Needs to have fast lookup for keys
      • Bulk of data resides on secondary storage
      • Minimize secondary storage access
    • Differences(additions) compared to RDBMS
      • No definitive score, returns only top k ranked hits
      • Relaxed transactional requirements
      • Two-level processing required ( details later..)
  • Requirements of a Search Index
    • Minimize disk access at the expense of CPU
    • Indexing
      • Support for Dynamic Indexing
        • Relatively fast indexing without compromising search time.
        • Simultaneous search/indexing
      • Index Compression
        • Speeds up overall process by resulting in less I/O
    • Searching
      • Needs an inverted index for efficient search
      • Merge search results for different query terms
  • Indexing Strategies
    • Primarily requires a sorted list of terms postings
    • Possible Data Structures
      • B-Tree based – O(log b N)
        • Requires random access, hence frequent disk seeks
      • Merge based – O(log k N)
        • Sorts in-memory, then merge the sorted runs/segments
          • Creates new segments for newly added documents
          • Merge existing segments
      • Trie Based – O(term-length)
        • Requires random access, frequent disk seeks
  • Lucene Indexing Fundamentals
    • An index contains a collection of documents
    • A document is a collection of fields
    • A field is a named collection of terms
    • A term is a paired string <field,term-string>
    • Inverted Index for efficient term-based search
  • Some Indexing Terms
    • Fields ( Independent search space)
      • Unlike CPL, fields are defined at run-time
      • Field Types : stored,indexed,tokenized
        • A field can be both stored and indexed
        • An indexed field can be tokenized
    • Segments (sub indexes)
      • Independent searchable index
      • Lucene index evolves with segments getting added/merged
      • Search results from segments are merged
  • Merge Algorithm
      • Maintain a stack of Segment Indexes
      • Create a segment/index for every doc added
      • Push new indexes into a stack
      • For (size=1;size < INF; size *= b) {
        • If (exists b indexes of size docs on top of stack) {
          • Pop them off the stack
          • Merge them into a single index
          • Push the merged index onto the stack
        • } Else {
          • break
        • } }
  • Lucene Indexing Algorithm
    • K-way merge – process at disk transfer rate
    • Average b*logN indexes
      • n=1M,b=2 gives 20 indexes
      • Fast to update and not too slow to search
    • Optimization
      • Small sized segments kept in RAM, saves I/O calls
  • Search Algorithm
    • Uses Inverted Index
      • Data Structure : term->doc-entries postings
      • Use skip lists to speed up search
    • Merge postings from different query terms
      • Efficient relevancy score calculation
      • Use Vector Space Model for similarity calculation
      • Parallel merge algorithm
  • A Lucene Segment <SegName, SegSize> SegCount Version Name Counter SegCount Format
  • Per Index Files
      • Segments File (“segments” file)
        • Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>^SegCount
      • Lock Files
        • Several files are used to indicate that another process is using an index. These are stored in java.io.tmpdir
      • Deletable Files
        • Only used on Win32, where a file may not be deleted while it is still open
      • Compound files (“.fcs” file, just a container)
        • Compound (.cfs) --> FileCount, <DataOffset, FileName>^FileCount, FileData^FileCount
  • Per Segment Files
    • Each segment has the following :
      • Field names (e.g. Title, content etc.)
      • Stored Field Values ( e.g. Urls )
      • Term dictionary
        • A dictionary containing all of the terms used in all of the indexed fields of all of the documents. It also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.
      • Term Frequency data.
        • For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document.
  • FileFormat (contd..)
      • Term Proximity data
        • For each term in the dictionary, the positions that the term occurs in each document.
      • Term Vectors
        • For each field in each document, the term vector (sometimes called document vector) is stored. A term vector consists of term text and term frequency.
      • Normalization factors
        • For each field in each document, a value is stored that is multiplied into the score for hits on that field.
      • Deleted Documents
  • Per Segment Files(prefix=segname)
      • Fields
          • Field Info ( .fnm file)
          • Stored Fields (.fdx and .fdt files)
      • Term Dictionary
        • Term Infos (.tis file)
          • TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
        • Term Info Index (.tii file)
          • This contains every IIth entry from the .tis file, along with its location in the &quot;tis&quot; file. This is designed to be read entirely into memory and used to provide random access to the &quot;tis&quot; file.
          • TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices
  • Per Segment Files (contd..)
      • Frequencies (.frq or the doc postings file)
        • Contains lists of documents which contain each term, along with the frequency of the term in that document.
        • Referred in .tis file as FreqData (with delta encoding)
      • Positions (.prx file)
        • Contains list of term positions within the documents
        • Used for Span Query, Phrase Query searches
      • Normalization (.f[0-9]* files)
        • A Norm file (.fn) for each indexed field with a byte/doc
  • Segment layout T i+1 T j+k T j T i .tii file (in memory) .tis file (in disk) Random seeks Contiguous reads T q IndexDelta .frq/posting file (in disk) T q+1 Term DocFreq FreqDelta ProxDelta SkipDelta TF 1 TF d Posting-list(term-freqs) SD 1 SD d/sk SD 2 DocId FreqSk TF sk ProxSk .prx file Used to merge postings
  • Index Compression
    • Variable byte encoding
      • Need very fast decoding (byte-align)
      • Given a binary representation
        • Form groups of 7-bits each (i.e. Block of 128)
        • If 1 st bit is 1 append the next 2-8 bits, do it recursively
    • Front Encoding of terms
        • Sorted terms commonly have long common prefix
          • 8,automata,8,automate,9,automatic,10,automation
          • 0,automata,7,e,7,ic,7,ion (huge saving....)
        • Works only if they are sorted lexicographically
  • Search Algorithm
    • Posting Lists (from .frq file)
      • For every term Ti, we have a posting list
        • Ti -> (Doc-id,Freq(d,t)) *
        • Ti's are sorted lexically, so are posting list on doc-id
      • Postings are delta encoded(for Index Compression)
    • Based on Vector Space Model
      • Conjunctive Search Algorithm (AND queries)
      • Disjunctive Search Algorithm (OR queries)
  • Vector Space Model
    • Terms constitute the space (or the axes)
    • Document/query are represented as a vector
    • Use cosine as a similarity measure (for docs)
    • Query is a short doc
    • TF-IDF weighting scheme
      • Tf * log(n/nt)
  • Search Algorithm (contd.)
    • Conjunctive Search Algorithm
      • The AND merge
        • Walk through 2 postings simultaneously, O(m+n)
        • Crucial – postings sorted by doc-id
      • Use skip pointers to speed up merge
    • Disjunctive search Algorithm
        • All postings should be merged
        • Minimize per postings processing
  • Others..
    • Positional Queries
      • Use positional information for doc/term tuple
    • Range Queries
      • Date Range queries
    • Stemming
      • Porter Stemmer Algorithm
    • Configurable Similarity algorithms
  • Open ended questions
    • Incremental Update to docs
      • Relevant to our focused crawl (for av files)
      • Hard problem to solve
    • Deficiencies in VSM
      • Need exact query terms
      • Use auto-classification/clustering
    • Q/A......