Upcoming SlideShare
×

# Information Retrieval with Open Source

1,201 views
1,113 views

Published on

Presentation by Tomasz Korzeniowski from Oredev conference in Malmoe 2007.

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,201
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
57
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Information Retrieval with Open Source

1. 1. Tomasz Korzeniowski tomek@polarrose.com
2. 2. Information Retrieval
3. 3. Retrieval strategies • Vector Space Model • Latent Semantic Indexing • Probabilistic Retrieval Strategies • Language Models • Inference Networks • Extended Boolean Retrieval • Neural Networks • Genetic Algorithms • Fuzzy Set Retrieval
4. 4. Vector space model
5. 5. Text retrieval
6. 6. Analysis
7. 7. Tokenization
8. 8. Stop-words
9. 9. Stemming Lemmatization
10. 10. http://tartarus.org/~martin/ PorterStemmer/
11. 11. Document Term
12. 12. Term frequency
13. 13. r boost for a query on ferrari than the get from a query on insurance. entInversionof a term used to sca frequency df document total number of documents in a corpu frequency follows: frequency (idf) of a term t as N idft = log . dft rare term is high, whereas the idf of a ure 6.4 gives an example of idf’s in a co
14. 14. g scheme assigns to term tf-idft,d = tft,d × idft . ssigns to term t a weigh
15. 15. Search
16. 16. 7 Vector space re 6 v(q)       v(d2 )   B ¨ ¨   ¨¨ v(d2 ) I   ¨ ¨   ¨¨  ¨¨  ¨ ¨   - ¨ Cosine similarity illustrated. igure 7.1
17. 17. Q: “gold silver truck” D1: “Shipment of gold damaged in a ﬁre” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
18. 18. TF a arrived damaged delivery ﬁre gold in of shipment silver truck D1 1 1 1 1 11 1 0 0 0 0 D2 1 1 1 11 2 0 0 0 0 0 D3 1 1 1 11 1 1 0 0 0 0 1 1 1 Q 0 0 0 0 0 00 0
19. 19. N idft = log . dft • • of area term is high, whereas 0the idf of 0 log 3/3 = log 3/3 = • arrived • silver re 6.4 gives0.176 example of idf’s in a an 0.477 log 3/2 = log 3/1 = • damaged • shipment ample logarithms are to the base 10. 0.477 0.176 log 3/1 = log 3/2 = • delivery • truck 0.477 0.176 log 3/1 = log 3/2 = • ﬁre • gold 0.477 0.176 log 3/1 = log 3/2 = always ﬁnite? • in 0 log 3/3 =
20. 20. a arrived damaged delivery ﬁre gold in of shipment silver truck 0.477 0.477 0.176 0 0 0.176 D1 0 0 0 0 0 0.176 0.477 0.954 0.176 D2 0 0 0 0 00 0 0.176 0.176 0 0 0.176 0.176 D3 0 0 0 0 0 0.176 0 0 0.477 0.176 Q 0 0 0 0 0 0
21. 21. SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0) (0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0) (0)+(0)(0.176)+(0.477)(0)+(0.176)(0)= (0.176)(0.176) ⋲ 0.031
22. 22. SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486 SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062
23. 23. Inverted index
24. 24. term - 1 (dn,1) (d10,1) term - 2 (dn,5) (dn,3) term - 3 (d2,11) (d10,1) term - 4 (dn,1) (d2,1) term - 5 (dn,2) (d4,3) term - n (d6,1) (d7,3)
25. 25. Lucene
26. 26. Analysis
27. 27. Lucene includes several built-in analyzers. The primary ones are shown in table 4.2. We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4. Table 4.2 Primary analyzers available in Lucene Analyzer Steps taken Splits tokens at whitespace WhitespaceAnalyzer Divides text at nonletter characters and lowercases SimpleAnalyzer Divides text at nonletter characters, lowercases, and removes stop words StopAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail StandardAnalyzer addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple- Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple- Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-
28. 28. Index
29. 29. Index • IndexWriter • Directory • Analyzer • Document • Field
30. 30. ex options: store store Value Description :no Don’t store ﬁeld :yes Store ﬁeld in its original format. Use this value if you want to highlight matches or print match excerpts a la Google search. :compressed Store ﬁeld in compressed format.
31. 31. index Index options: index Value Description :no Do not make this ﬁeld searchable. :yes Make this ﬁeld searchable and tok- enize its contents. :untokenized Make this ﬁeld searchable but do not tokenize its contents. Use this value for ﬁelds you wish to sort by. :omit norms Same as :yes except omit the norms ﬁle. The norms ﬁle can be omit- ted if you don’t boost any ﬁelds and you don’t need scoring based on ﬁeld length. :untokenized omit norms Same as :untokenized except omit the norms ﬁle. Ruby Day Kraków: Full Text Search with Ferret
32. 32. term_vector Index options: term vector Value Description :no Don’t store term-vectors :yes Store term-vectors without storing positions or oﬀsets. :with positions Store term-vectors with positions. :with oﬀsets Store term-vectors with oﬀsets. :with positions ofssets Store term-vectors with positions and oﬀ- sets. Ruby Day Kraków: Full Text Search with Ferret
33. 33. Search
34. 34. Search • IndexSearcher • Term • Query • Hits
35. 35. Query
36. 36. Query • API • new TermQuery(new Term(“name”,”Tomek”)); • Lucene QueryParser • queryParser.parse(“name:Tomekquot;);
37. 37. TermQuery name:Tomek
38. 38. BooleanQuery ramobo OR ninja +rambo +ninja –name:rocky
39. 39. PhraseQuery “ninja java” –name:rocky
40. 40. SloppyPhraseQuery “red-faced politicians”~3
41. 41. RangeQuery releaseDate:[2000 TO 2007]
42. 42. WildcardQuery sup?r, su*r, super*
43. 43. FuzzyQuery color~ colour, collor, colro
44. 44. http://en.wikipedia.org/wiki/Levenshtein_distance color colour - 1 colour coller - 2
45. 45. Equation 1. Levenstein Distance Score This means that an exact match will h corresponding letters will have a score
46. 46. Boost title:Spring^10