A dictionary containing all of the terms used in all of the indexed fields of all of the documents. It also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.
Term Frequency data.
For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document.
Term Proximity data
For each term in the dictionary, the positions that the term occurs in each document.
For each field in each document, the term vector (sometimes called document vector) is stored. A term vector consists of term text and term frequency.
For each field in each document, a value is stored that is multiplied into the score for hits on that field.
This contains every IIth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
Contains lists of documents which contain each term, along with the frequency of the term in that document.
Referred in .tis file as FreqData (with delta encoding)
Positions (.prx file)
Contains list of term positions within the documents
Used for Span Query, Phrase Query searches
Normalization (.f[0-9]* files)
A Norm file (.fn) for each indexed field with a byte/doc
Segment layout T i+1 T j+k T j T i .tii file (in memory) .tis file (in disk) Random seeks Contiguous reads T q IndexDelta .frq/posting file (in disk) T q+1 Term DocFreq FreqDelta ProxDelta SkipDelta TF 1 TF d Posting-list(term-freqs) SD 1 SD d/sk SD 2 DocId FreqSk TF sk ProxSk .prx file Used to merge postings
Variable byte encoding
Need very fast decoding (byte-align)
Given a binary representation
Form groups of 7-bits each (i.e. Block of 128)
If 1 st bit is 1 append the next 2-8 bits, do it recursively
Front Encoding of terms
Sorted terms commonly have long common prefix
0,automata,7,e,7,ic,7,ion (huge saving....)
Works only if they are sorted lexicographically
Posting Lists (from .frq file)
For every term Ti, we have a posting list
Ti -> (Doc-id,Freq(d,t)) *
Ti's are sorted lexically, so are posting list on doc-id