Information Retrieval with Open Source - Presentation Transcript
Tomasz Korzeniowski
tomek@polarrose.com
Information Retrieval
Retrieval strategies
• Vector Space Model
• Latent Semantic Indexing
• Probabilistic Retrieval Strategies
• Language Models
• Inference Networks
• Extended Boolean Retrieval
• Neural Networks
• Genetic Algorithms
• Fuzzy Set Retrieval
Vector space model
Text retrieval
Analysis
Tokenization
Stop-words
Stemming
Lemmatization
http://tartarus.org/~martin/
PorterStemmer/
Document
Term
Term frequency
r boost for a query on ferrari than the
get from a query on insurance.
entInversionof a term used to sca
frequency df document
total number of documents in a corpu
frequency follows:
frequency (idf) of a term t as
N
idft = log .
dft
rare term is high, whereas the idf of a
ure 6.4 gives an example of idf’s in a co
g scheme assigns to term
tf-idft,d = tft,d × idft .
ssigns to term t a weigh
Search
7 Vector space re
6
v(q)
v(d2 )
B
¨
¨
¨¨ v(d2 )
I
¨
¨
¨¨
¨¨
¨
¨
-
¨
Cosine similarity illustrated.
igure 7.1
Q: “gold silver truck”
D1: “Shipment of gold damaged in a
fire”
D2: “Delivery of silver arrived in a
silver truck”
D3: “Shipment of gold arrived in a
truck”
term - 1 (dn,1) (d10,1)
term - 2 (dn,5) (dn,3)
term - 3 (d2,11) (d10,1)
term - 4 (dn,1) (d2,1)
term - 5 (dn,2) (d4,3)
term - n (d6,1) (d7,3)
Lucene
Analysis
Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.
We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer
and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper,
PerFieldAnalyzerWrapper, to section 4.4.
Table 4.2 Primary analyzers available in Lucene
Analyzer Steps taken
Splits tokens at whitespace
WhitespaceAnalyzer
Divides text at nonletter characters and lowercases
SimpleAnalyzer
Divides text at nonletter characters, lowercases, and removes stop words
StopAnalyzer
Tokenizes based on a sophisticated grammar that recognizes e-mail
StandardAnalyzer
addresses, acronyms, Chinese-Japanese-Korean characters,
alphanumerics, and more; lowercases; and removes stop words
The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-
Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in
almost any Western (European-based) language. You can see the effect of each of
these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-
Analyzer are both trivial and we don’t cover them in more detail here. We explore
the StopAnalyzer and StandardAnalyzer in more depth because they have non-
Index
Index
• IndexWriter
• Directory
• Analyzer
• Document
• Field
ex options: store
store
Value Description
:no Don’t store field
:yes Store field in its original format.
Use this value if you want to highlight
matches or print match excerpts a la Google
search.
:compressed Store field in compressed format.
index
Index options: index
Value Description
:no Do not make this field searchable.
:yes Make this field searchable and tok-
enize its contents.
:untokenized Make this field searchable but do not
tokenize its contents. Use this value
for fields you wish to sort by.
:omit norms Same as :yes except omit the norms
file. The norms file can be omit-
ted if you don’t boost any fields and
you don’t need scoring based on field
length.
:untokenized omit norms Same as :untokenized except omit the
norms file.
Ruby Day Kraków: Full Text Search with Ferret
term_vector
Index options: term vector
Value Description
:no Don’t store term-vectors
:yes Store term-vectors without storing positions
or offsets.
:with positions Store term-vectors with positions.
:with offsets Store term-vectors with offsets.
:with positions ofssets Store term-vectors with positions and off-
sets.
Ruby Day Kraków: Full Text Search with Ferret
Search
Search
• IndexSearcher
• Term
• Query
• Hits
Query
Query
• API
• new TermQuery(new Term(“name”,”Tomek”));
• Lucene QueryParser
• queryParser.parse(“name:Tomek\");
TermQuery
name:Tomek
BooleanQuery
ramobo OR ninja
+rambo +ninja –name:rocky
PhraseQuery
“ninja java” –name:rocky
SloppyPhraseQuery
“red-faced politicians”~3
RangeQuery
releaseDate:[2000 TO 2007]
WildcardQuery
sup?r, su*r, super*
FuzzyQuery
color~
colour, collor, colro
http://en.wikipedia.org/wiki/Levenshtein_distance
color colour - 1
colour coller - 2
Equation 1. Levenstein Distance Score
This means that an exact match will h
corresponding letters will have a score
0 comments
Post a comment