Information Retrieval with Open Source

Tomasz Korzeniowski
tomek@polarrose.com

Retrieval strategies
• Vector Space Model
• Latent Semantic Indexing
• Probabilistic Retrieval Strategies
• Language Models
• Inference Networks
• Extended Boolean Retrieval
• Neural Networks
• Genetic Algorithms
• Fuzzy Set Retrieval

http://tartarus.org/~martin/
PorterStemmer/

r boost for a query on ferrari than the
get from a query on insurance.
entInversionof a term used to sca
frequency df document
total number of documents in a corpu
frequency follows:
frequency (idf) of a term t as
N
idft = log .
dft
rare term is high, whereas the idf of a
ure 6.4 gives an example of idf’s in a co

g scheme assigns to term

tf-idft,d = tft,d × idft .
ssigns to term t a weigh

7 Vector space re

6

v(q)

v(d2 )
B
¨
¨
¨¨ v(d2 )
I
¨

¨
¨¨
¨¨
¨
¨
-
¨

Cosine similarity illustrated.
igure 7.1

Q: “gold silver truck”

D1: “Shipment of gold damaged in a
ﬁre”

D2: “Delivery of silver arrived in a
silver truck”

D3: “Shipment of gold arrived in a
truck”

TF

a arrived damaged delivery ﬁre gold in of shipment silver truck

D1 1 1 1 1 11 1
0 0 0 0

D2 1 1 1 11 2
0 0 0 0 0

D3 1 1 1 11 1 1
0 0 0 0

1 1 1
Q 0 0 0 0 0 00 0

N
idft = log .
dft
• • of
area term is high, whereas 0the idf of
0
log 3/3 = log 3/3 =

• arrived • silver
re 6.4 gives0.176 example of idf’s in a
an 0.477
log 3/2 = log 3/1 =

• damaged • shipment
ample logarithms are to the base 10.
0.477 0.176
log 3/1 = log 3/2 =

• delivery • truck
0.477 0.176
log 3/1 = log 3/2 =

• ﬁre • gold
0.477 0.176
log 3/1 = log 3/2 =

always ﬁnite?
• in 0
log 3/3 =

a arrived damaged delivery ﬁre gold in of shipment silver truck

0.477 0.477 0.176 0 0 0.176
D1 0 0 0 0 0

0.176 0.477 0.954 0.176
D2 0 0 0 0 00 0

0.176 0.176 0 0 0.176 0.176
D3 0 0 0 0 0

0.176 0 0 0.477 0.176
Q 0 0 0 0 0 0

SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)
(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)
(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=
(0.176)(0.176) ⋲ 0.031

SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486

SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062

term - 1 (dn,1) (d10,1)

term - 2 (dn,5) (dn,3)

term - 3 (d2,11) (d10,1)

term - 4 (dn,1) (d2,1)

term - 5 (dn,2) (d4,3)

term - n (d6,1) (d7,3)

Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.
We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer
and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper,
PerFieldAnalyzerWrapper, to section 4.4.

Table 4.2 Primary analyzers available in Lucene

Analyzer Steps taken

Splits tokens at whitespace
WhitespaceAnalyzer

Divides text at nonletter characters and lowercases
SimpleAnalyzer

Divides text at nonletter characters, lowercases, and removes stop words
StopAnalyzer

Tokenizes based on a sophisticated grammar that recognizes e-mail
StandardAnalyzer
addresses, acronyms, Chinese-Japanese-Korean characters,
alphanumerics, and more; lowercases; and removes stop words

The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-
Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in
almost any Western (European-based) language. You can see the effect of each of
these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-
Analyzer are both trivial and we don’t cover them in more detail here. We explore
the StopAnalyzer and StandardAnalyzer in more depth because they have non-

Index

• IndexWriter
• Directory
• Analyzer
• Document
• Field

ex options: store
store
Value Description
:no Don’t store field
:yes Store field in its original format.
Use this value if you want to highlight
matches or print match excerpts a la Google
search.
:compressed Store field in compressed format.

index
Index options: index

Value Description
:no Do not make this field searchable.
:yes Make this field searchable and tok-
enize its contents.
:untokenized Make this field searchable but do not
tokenize its contents. Use this value
for fields you wish to sort by.
:omit norms Same as :yes except omit the norms
file. The norms file can be omit-
ted if you don’t boost any fields and
you don’t need scoring based on field
length.
:untokenized omit norms Same as :untokenized except omit the
norms file.
Ruby Day Kraków: Full Text Search with Ferret

term_vector
Index options: term vector

Value Description
:no Don’t store term-vectors
:yes Store term-vectors without storing positions
or offsets.
:with positions Store term-vectors with positions.
:with offsets Store term-vectors with offsets.
:with positions ofssets Store term-vectors with positions and off-
sets.

Ruby Day Kraków: Full Text Search with Ferret

Search

• IndexSearcher
• Term
• Query
• Hits

Query

• API
• new TermQuery(new Term(“name”,”Tomek”));

• Lucene QueryParser
• queryParser.parse(“name:Tomekquot;);

BooleanQuery
ramobo OR ninja

+rambo +ninja –name:rocky

PhraseQuery
“ninja java” –name:rocky

SloppyPhraseQuery
“red-faced politicians”~3

RangeQuery
releaseDate:[2000 TO 2007]

WildcardQuery
sup?r, su*r, super*

FuzzyQuery
color~

colour, collor, colro

http://en.wikipedia.org/wiki/Levenshtein_distance

color colour - 1

colour coller - 2

Equation 1. Levenstein Distance Score

This means that an exact match will h
corresponding letters will have a score

Information Retrieval with Open Source

Information Retrieval with Open Source

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Information Retrieval with Open Source

Similar to Information Retrieval with Open Source (8)

Recently uploaded

Recently uploaded (20)

Information Retrieval with Open Source