Tomasz Korzeniowski
   tomek@polarrose.com
Information Retrieval
Retrieval strategies
• Vector Space Model
• Latent Semantic Indexing
• Probabilistic Retrieval Strategies
• Language Model...
Vector space model
Text retrieval
Analysis
Tokenization
Stop-words
Stemming

Lemmatization
http://tartarus.org/~martin/
      PorterStemmer/
Document

 Term
Term frequency
r boost for a query on ferrari than the
 get from a query on insurance.
  entInversionof a term used to sca
      frequenc...
g scheme assigns to term


 tf-idft,d = tft,d × idft .
ssigns to term t a weigh
Search
7 Vector space re

            6


                                      v(q)
                     
                      ...
Q: “gold silver truck”

D1: “Shipment of gold damaged in a
fire”

D2: “Delivery of silver arrived in a
silver truck”

D3: “...
TF

    a arrived damaged delivery   fire   gold   in of shipment silver truck

D1 1             1               1      1  ...
N
                 idft = log        .
                            dft
  •                        • of
area term is high, ...
a arrived damaged delivery    fire   gold   in of shipment silver truck

                0.477            0.477 0.176 0 0  ...
SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)
(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)
(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=
(0....
SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486

SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062
Inverted index
term - 1   (dn,1)    (d10,1)



term - 2   (dn,5)    (dn,3)



term - 3   (d2,11)   (d10,1)



term - 4   (dn,1)    (d2,1)...
Lucene
Analysis
Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.
We’ll leave discussion of the two lan...
Index
Index

• IndexWriter
• Directory
• Analyzer
• Document
• Field
ex options: store
                         store
  Value         Description
  :no           Don’t store field
  :yes      ...
index
Index options: index

        Value                                   Description
        :no                       ...
term_vector
Index options: term vector



        Value                                   Description
        :no         ...
Search
Search

• IndexSearcher
• Term
• Query
• Hits
Query
Query

• API
 •   new TermQuery(new Term(“name”,”Tomek”));

• Lucene QueryParser
 •   queryParser.parse(“name:Tomekquot;);
TermQuery
 name:Tomek
BooleanQuery
    ramobo OR ninja

+rambo +ninja –name:rocky
PhraseQuery
“ninja java” –name:rocky
SloppyPhraseQuery
 “red-faced politicians”~3
RangeQuery
releaseDate:[2000 TO 2007]
WildcardQuery
 sup?r, su*r, super*
FuzzyQuery
      color~

 colour, collor, colro
http://en.wikipedia.org/wiki/Levenshtein_distance


                 color colour - 1

                  colour coller - 2
Equation 1. Levenstein Distance Score




This means that an exact match will h
corresponding letters will have a score
Boost
title:Spring^10
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Information Retrieval with Open Source
Upcoming SlideShare
Loading in...5
×

Information Retrieval with Open Source

2,474
-1

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,474
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
84
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Information Retrieval with Open Source

  1. 1. Tomasz Korzeniowski tomek@polarrose.com
  2. 2. Information Retrieval
  3. 3. Retrieval strategies • Vector Space Model • Latent Semantic Indexing • Probabilistic Retrieval Strategies • Language Models • Inference Networks • Extended Boolean Retrieval • Neural Networks • Genetic Algorithms • Fuzzy Set Retrieval
  4. 4. Vector space model
  5. 5. Text retrieval
  6. 6. Analysis
  7. 7. Tokenization
  8. 8. Stop-words
  9. 9. Stemming Lemmatization
  10. 10. http://tartarus.org/~martin/ PorterStemmer/
  11. 11. Document Term
  12. 12. Term frequency
  13. 13. r boost for a query on ferrari than the get from a query on insurance. entInversionof a term used to sca frequency df document total number of documents in a corpu frequency follows: frequency (idf) of a term t as N idft = log . dft rare term is high, whereas the idf of a ure 6.4 gives an example of idf’s in a co
  14. 14. g scheme assigns to term tf-idft,d = tft,d × idft . ssigns to term t a weigh
  15. 15. Search
  16. 16. 7 Vector space re 6 v(q)       v(d2 )   B ¨ ¨   ¨¨ v(d2 ) I   ¨ ¨   ¨¨  ¨¨  ¨ ¨   - ¨ Cosine similarity illustrated. igure 7.1
  17. 17. Q: “gold silver truck” D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
  18. 18. TF a arrived damaged delivery fire gold in of shipment silver truck D1 1 1 1 1 11 1 0 0 0 0 D2 1 1 1 11 2 0 0 0 0 0 D3 1 1 1 11 1 1 0 0 0 0 1 1 1 Q 0 0 0 0 0 00 0
  19. 19. N idft = log . dft • • of area term is high, whereas 0the idf of 0 log 3/3 = log 3/3 = • arrived • silver re 6.4 gives0.176 example of idf’s in a an 0.477 log 3/2 = log 3/1 = • damaged • shipment ample logarithms are to the base 10. 0.477 0.176 log 3/1 = log 3/2 = • delivery • truck 0.477 0.176 log 3/1 = log 3/2 = • fire • gold 0.477 0.176 log 3/1 = log 3/2 = always finite? • in 0 log 3/3 =
  20. 20. a arrived damaged delivery fire gold in of shipment silver truck 0.477 0.477 0.176 0 0 0.176 D1 0 0 0 0 0 0.176 0.477 0.954 0.176 D2 0 0 0 0 00 0 0.176 0.176 0 0 0.176 0.176 D3 0 0 0 0 0 0.176 0 0 0.477 0.176 Q 0 0 0 0 0 0
  21. 21. SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0) (0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0) (0)+(0)(0.176)+(0.477)(0)+(0.176)(0)= (0.176)(0.176) ⋲ 0.031
  22. 22. SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486 SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062
  23. 23. Inverted index
  24. 24. term - 1 (dn,1) (d10,1) term - 2 (dn,5) (dn,3) term - 3 (d2,11) (d10,1) term - 4 (dn,1) (d2,1) term - 5 (dn,2) (d4,3) term - n (d6,1) (d7,3)
  25. 25. Lucene
  26. 26. Analysis
  27. 27. Lucene includes several built-in analyzers. The primary ones are shown in table 4.2. We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4. Table 4.2 Primary analyzers available in Lucene Analyzer Steps taken Splits tokens at whitespace WhitespaceAnalyzer Divides text at nonletter characters and lowercases SimpleAnalyzer Divides text at nonletter characters, lowercases, and removes stop words StopAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail StandardAnalyzer addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple- Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple- Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-
  28. 28. Index
  29. 29. Index • IndexWriter • Directory • Analyzer • Document • Field
  30. 30. ex options: store store Value Description :no Don’t store field :yes Store field in its original format. Use this value if you want to highlight matches or print match excerpts a la Google search. :compressed Store field in compressed format.
  31. 31. index Index options: index Value Description :no Do not make this field searchable. :yes Make this field searchable and tok- enize its contents. :untokenized Make this field searchable but do not tokenize its contents. Use this value for fields you wish to sort by. :omit norms Same as :yes except omit the norms file. The norms file can be omit- ted if you don’t boost any fields and you don’t need scoring based on field length. :untokenized omit norms Same as :untokenized except omit the norms file. Ruby Day Kraków: Full Text Search with Ferret
  32. 32. term_vector Index options: term vector Value Description :no Don’t store term-vectors :yes Store term-vectors without storing positions or offsets. :with positions Store term-vectors with positions. :with offsets Store term-vectors with offsets. :with positions ofssets Store term-vectors with positions and off- sets. Ruby Day Kraków: Full Text Search with Ferret
  33. 33. Search
  34. 34. Search • IndexSearcher • Term • Query • Hits
  35. 35. Query
  36. 36. Query • API • new TermQuery(new Term(“name”,”Tomek”)); • Lucene QueryParser • queryParser.parse(“name:Tomekquot;);
  37. 37. TermQuery name:Tomek
  38. 38. BooleanQuery ramobo OR ninja +rambo +ninja –name:rocky
  39. 39. PhraseQuery “ninja java” –name:rocky
  40. 40. SloppyPhraseQuery “red-faced politicians”~3
  41. 41. RangeQuery releaseDate:[2000 TO 2007]
  42. 42. WildcardQuery sup?r, su*r, super*
  43. 43. FuzzyQuery color~ colour, collor, colro
  44. 44. http://en.wikipedia.org/wiki/Levenshtein_distance color colour - 1 colour coller - 2
  45. 45. Equation 1. Levenstein Distance Score This means that an exact match will h corresponding letters will have a score
  46. 46. Boost title:Spring^10
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×