Information Access with     Apache Lucene                              Pierpaolo Basile, PhD     Research Assistant Univer...
Outline• Search Engine  – Vector Space Model  – Inverted Index• Apache Lucene  – Indexing  – Searching• Advanced topics  –...
SEARCH ENGINE
Searching            Collection of             documents            User query             Relevant            documents
Information RetrievalInformation Retrieval (IR) is the area of studyconcerned with searching for documents,for information...
Information Retrieval Process            User feedback        User                               Interface                ...
Information Retrieval Model<D, Q, F, R(qi, dj)>• D: document representation• Q: query representation• F: query/document re...
Bag-of-words representationDocument/query as unordered collection of wordsJohn likes to watch movies. Mary likes too. John...
Vector Space ModelQuery/Document Framework• Algebraic model• Documents/queries are represented as vector  in a geometric s...
Vector Space Model     d j  ( w1, j , w2, j ,, wt , j )     q  ( w1,q , w2,q ,, wt ,q )Each vector component correspon...
Vector Space Modeldj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1)      refresh                                      dj      ...
Vector Space ModelRanking function• Vector similarity between query and  document computed by cosine                      ...
Vector Space Modeldj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1)      refresh                                      dj      ...
Term-Document matrix • ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’ • ‘It’s the oldes...
Term-Document matrix                 D1      D2      D3Cheshire Cat     1       0       1Alice            2       2       ...
Term-Document matrix                  D1        D2   D3Cheshire Cat      1         0    1Alice             2         2    ...
Inverted Index
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionary     Posting ListAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin        ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…Cheshire Cat D1:1 D3:1…grin         D3:2…King       ...
Inverted IndexDictionary     Posting List                 Position ListAlice        D1:2 D2:2 D3:2   14, 49        1, 40  ...
Inverted Index: query processingDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…                              AND ...
Inverted Index: query processingDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…                              OR (...
Inverted Index: query processingDictionaryAlice        D1:2 D2:2 D3:2…book         D1:1…                              NOT ...
Term-weight• Measures the term relevance in a document  – component value in the document representation  – TF*IDF     • T...
Term-weight• Measures the term relevance in a document  – component value in the document representation  – TF*IDF     • T...
TF*IDF insights• increases proportionally to the term  frequency in the document• decreases to the number of documents in ...
Inverted index/TF*IDF• TF: computed by term occurrences in the  posting list• IDF: computed by the posting list cardinalit...
Inverted index/TF*IDF• TF: computed by term occurrences in the  posting list• IDF: computed by the posting list cardinalit...
http://lucene.apache.orgAPACHE LUCENE
Apache Lucene• Apache Software Foundation project  – http://lucene.apache.org• What Lucene is  – Search library: indexing ...
Lucene•   Full-text search•   High performance•   Scalable•   Cross-platform•   100%-pure Java•   Porting in other languag...
Information Retrieval Process            User feedback        User                               Interface                ...
Information Retrieval Process             User feedback           User                                   Interface        ...
Lucene Model 1/2• Based on Vector Space Model• Features:  –   multi-field document  –   tf–term frequency: number of match...
Lucene Model 1/2• coor(q,d) score factor based on how many of the query terms are  found in the specified document• queryN...
Lucene         TopDocs
Document IndexingFSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));Analyzer analyzer = new StandardAnal...
Field Options• Field.Store  – NO: field value is not stored  – YES: field value is stored• Field.Index  – ANALYZED: field ...
Luke• Lucene diagnostic tool: http://code.google.com/p/luke/
Document: Fields Representationhttp://www.bbc.co.uk/news/technology-15365207
Document: Fields Representationhttp://www.bbc.co.uk/news/technology-15365207                    URL: Index.Store.YES,     ...
Text Operations• Tokenization  – split text in token• Stop word elimination  – remove common words and closed word class  ...
Analyzers• KeywordAnalyzer   – Returns a single token• WhitespaceAnalyzer   – Splits tokens at whitespace• Simple Analyzer...
Analyzers    The quick brown fox jumped over the lazy dogs•   WhitespaceAnalyzer:    [The] [quick] [brown] [fox] [jumped] ...
Analyzers  "XY&Z Corporation - xyz@example.com"• WhitespaceAnalyzer:  [XY&Z] [Corporation] [-] [xyz@example.com]• SimpleAn...
Tokenizers• A Tokenizer is a TokenStream whose input is a Reader• Tokenizers break field text into tokens• StandardTokeniz...
TokenFilters• A TokenFilter is a TokenStream whose input is  another token stream• LowerCaseFilter• StopFilter• LengthFilt...
Analyzerpublic class MyAnalyzer2 extends Analyzer {private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new    Array...
TermVectors• Store information about term frequency,  positions and offsets• Relevance Feedback and “More Like This”• Clus...
Lucene TermVector• TermFreqVector is a representation of all of the  terms and term counts in a specific Field of a  Docum...
Lucene TermPositionVector• TermPositionVector is a representation of all  of the terms, term counts, term position and ter...
Store TermVector• During indexing, create a Field that stores Term• Vectors:   – new     Field(“text",text,Field.Store.YES...
Term Vector: Positions and Offsets       Positions   0           1            2           3              4              5 ...
Example• Main method with three parameters   – Input_dir: directory to index   – Index_dir: directory containing the index...
Lucene Search• IndexReader: interface for accessing an index• QueryParser: parses the user query• IndexSearcher: implement...
Lucene SearchDirectory dir = FSDirectory.open(new File(args[0]));IndexReader reader = IndexReader.open(dir);IndexSearcher ...
Lucene QueryParser• Example:  – queryParser.parse(“name:SpiderMan”);• good human entered queries, debugging• does text ana...
QUERY SYNTAX 1/2• title:"The Right Way" AND text:go    – Phrase query + term query• Wildcard Searches    – tes* (test - te...
QUERY SYNTAX 2/2• Boosting a Term    – jakarta^4 apache    – "jakarta apache"^4 "Apache Lucene”• Boolean Operator    – NOT...
QUERY (BY CODE) 1/2TermQuery query =new TermQuery(new Term(“name”,”Spider-Man”))• explicit, no escaping necessary• does no...
QUERY (BY CODE) 2/2TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))TermQuery tq2 = new TermQuery(new Term(“name”,”p...
DELETING DOCUMENTSIndexWriter  – deleteDocuments(Term t)  – deleteDocuments(Term… t)  – deleteDocuments(Query t)  – delete...
Relevance feedbackDocuments similarityApache TikaADVANCED TOPICS
Relevance feedback• Improve retrieval performance using  information about document relevance  – Explicit: relevance indic...
Relevance feedbackUser query                                                  New                                         ...
Rocchio Algorithm                  1                1             Qnew    Q           Dj            Dk   ...
Rocchio Algorithm (blind)                  1                1             Qnew    Q           Dj          ...
Relevance feedback   Q = Alice                                   Qnew = Alice Cheshire Cat King                           ...
Relevance feedback in Lucene• Possible using TermVector  – Retrieve the top K documents  – Build the Qnew using TermVector...
Document similarity• Documents are represented by vectors  – Build the document vector using TermVector  – Compute the sim...
Apache Tika• Extract metadata and text from several file  formats  – XML, HTML, OpenOffice, Microsoft Office, PDF  – MBOX ...
Extract metadata and textTika tika = new Tika();String type = tika.detect(new File(args[0]));System.out.println("File type...
Apache Tika + LuceneFiles/URLs              metadata  Tika                   Lucene                text
CONCLUSIONS
Conclusions• A popular IR model  – Vector Space Model• Lucene  – provides API to build search engine• Apace Tika  – extrac...
Lucene Contrib• analyzers: new analyzer  – Arabic, Chinese, …• highlighter: a set of classes for highlighting  matching te...
Lucene related project• Apache Solr: open source enterprise search  platform from the Apache Lucene  – Full-Text Search Ca...
Thank you for your attention!         Questions?
Upcoming SlideShare
Loading in …5
×

[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene

2,873 views

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,873
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene

  1. 1. Information Access with Apache Lucene Pierpaolo Basile, PhD Research Assistant Univeristy Of Bari Aldo Moro co-founder QuestionCube s.r.l. pierpaolo.basile@{uniba.it, questioncube.com} CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012 S. VITO DEI NORMANNI (BR)
  2. 2. Outline• Search Engine – Vector Space Model – Inverted Index• Apache Lucene – Indexing – Searching• Advanced topics – Relevance feedback – Document similarity – Apache Tika
  3. 3. SEARCH ENGINE
  4. 4. Searching Collection of documents User query Relevant documents
  5. 5. Information RetrievalInformation Retrieval (IR) is the area of studyconcerned with searching for documents,for information within documents, andfor metadata about documents, as well as that ofsearching structured storage, relational databases,and the World Wide Web. Wikipedia
  6. 6. Information Retrieval Process User feedback User Interface Query Docs Doc Text Doc Text Operations Doc Rankeddocuments Logic view Query Indexing Operations Query Inverted Index Searching Index Retrived documents Ranking
  7. 7. Information Retrieval Model<D, Q, F, R(qi, dj)>• D: document representation• Q: query representation• F: query/document representation framework• R(qi, dj): ranking function
  8. 8. Bag-of-words representationDocument/query as unordered collection of wordsJohn likes to watch movies. Mary likes too. John also likes towatch football games. {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}
  9. 9. Vector Space ModelQuery/Document Framework• Algebraic model• Documents/queries are represented as vector in a geometric space• Terms are vector components
  10. 10. Vector Space Model d j  ( w1, j , w2, j ,, wt , j ) q  ( w1,q , w2,q ,, wt ,q )Each vector component corresponds to a termThe component value wi,j is the weight of the term tiin the document dj
  11. 11. Vector Space Modeldj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1) refresh dj q1 q2 hit
  12. 12. Vector Space ModelRanking function• Vector similarity between query and document computed by cosine d j q R(d j , q)  cos   dj  q
  13. 13. Vector Space Modeldj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1) refresh dj q1 θ1 θ2 q2 hit
  14. 14. Term-Document matrix • ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’ • ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice. • Alice watched the White King as he slowly struggled up from bar to bar, till at last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate. • Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark. • In the pool of light was a billiards table, with two figures moving around it. Alice walked toward them, and as she approached they turned to look at her. • Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cats grin?
  15. 15. Term-Document matrix D1 D2 D3Cheshire Cat 1 0 1Alice 2 2 2book 1 0 0King 1 1 0table 0 1 1Queen 0 1 1grin 0 0 2
  16. 16. Term-Document matrix D1 D2 D3Cheshire Cat 1 0 1Alice 2 2 2book 1 0 0King 1 1 0table 0 1 1Queen 0 1 1grin 0 0 2 0Query: 1 0Alice AND Queen 0 0 1 0
  17. 17. Inverted Index
  18. 18. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  19. 19. Inverted IndexDictionary Posting ListAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  20. 20. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  21. 21. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  22. 22. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  23. 23. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  24. 24. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  25. 25. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  26. 26. Inverted IndexDictionaryAlice D1:2 D2:2 D3:2…book D1:1…Cheshire Cat D1:1 D3:1…grin D3:2…King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  27. 27. Inverted IndexDictionary Posting List Position ListAlice D1:2 D2:2 D3:2 14, 49 1, 40 18, 34…book D1:1 33…Cheshire Cat D1:1 D3:1…grin D3:2… Term positions in D1King D1:1 D2:1…Queen D2:1 D3:1…table D2:1 D3:1…
  28. 28. Inverted Index: query processingDictionaryAlice D1:2 D2:2 D3:2…book D1:1… AND (intersection )Cheshire Cat D1:1 D3:1… D1:2 D2:2 D3:2grin D3:2 <Alice><King>… D1:1 D2:1King D1:1 D2:1…Queen D2:1 D3:1 Alice AND King -> (D1, D2)…table D2:1 D3:1…
  29. 29. Inverted Index: query processingDictionaryAlice D1:2 D2:2 D3:2…book D1:1… OR (union )Cheshire Cat D1:1 D3:1… D3:2grin D3:2 <grin>  <King>… D1:1 D2:1King D1:1 D2:1…Queen D2:1 D3:1 Alice AND King -> (D1, D2, D3)…table D2:1 D3:1…
  30. 30. Inverted Index: query processingDictionaryAlice D1:2 D2:2 D3:2…book D1:1… NOT (complement )Cheshire Cat D1:1 D3:1… D1:2 D2:2 D3:2grin D3:2 <Alice> <grin>… D3:2King D1:1 D2:1…Queen D2:1 D3:1 Alice NOT grin -> (D1, D2)…table D2:1 D3:1…
  31. 31. Term-weight• Measures the term relevance in a document – component value in the document representation – TF*IDF • TF (term frequency): term occurrences in the document • IDF (inverse document frequency): inverse to the number of documents in which the term occurs D tf * idf (t , d )  tf (t , d ) * log {d  D : t  d }
  32. 32. Term-weight• Measures the term relevance in a document – component value in the document representation – TF*IDF • TF (term frequency): term occurrences in the document • IDF (inverse document frequency): inverse to the number of documents in which the term occurs Number of documents in D the collection tf * idf (t , d )  tf (t , d ) * log {d  D : t  d } idf
  33. 33. TF*IDF insights• increases proportionally to the term frequency in the document• decreases to the number of documents in which the term belongs – common words are generally more frequent in the collection• IDF depends on the collection, TF on the document
  34. 34. Inverted index/TF*IDF• TF: computed by term occurrences in the posting list• IDF: computed by the posting list cardinality Alice D1:2 D2:2 D3:2 |D|=100 TF 100 tf * idf ( Alice , D1)  2 * log 3
  35. 35. Inverted index/TF*IDF• TF: computed by term occurrences in the posting list• IDF: computed by the posting list cardinality Alice D1:2 D2:2 D3:2 |D|=100 IDF 100 tf * idf ( Alice , D1)  2 * log 3
  36. 36. http://lucene.apache.orgAPACHE LUCENE
  37. 37. Apache Lucene• Apache Software Foundation project – http://lucene.apache.org• What Lucene is – Search library: indexing and searching Application Programming Interface (API)• What Lucene is not – Search engine (no crawling, server, user interface, etc.)
  38. 38. Lucene• Full-text search• High performance• Scalable• Cross-platform• 100%-pure Java• Porting in other languages: .NET, Python
  39. 39. Information Retrieval Process User feedback User Interface Query Text Doc Text Operations Rankeddocuments Logic view Query Indexing Operations Query Inverted Index Searching Index Retrived documents Ranking
  40. 40. Information Retrieval Process User feedback User Interface org.apache.lucene Query Text Document Analyzer/Tokenizer/Filter Rankeddocuments Logic view QueryParser IndexWriter Query Inverted Index IndexReader IndexSearcher Index Retrived documents Similarity
  41. 41. Lucene Model 1/2• Based on Vector Space Model• Features: – multi-field document – tf–term frequency: number of matching terms in field – lengthNorm: number of tokens in field – idf: inverse document frequency – coord: coordination factor, number of matching• Terms – field boost – query clause boost
  42. 42. Lucene Model 1/2• coor(q,d) score factor based on how many of the query terms are found in the specified document• queryNorm(q,d) is a normalizing factor used to make scores between queries comparable• boost(t) term boost• norm(t,d) encapsulates a few (indexing time) boost and length factors
  43. 43. Lucene TopDocs
  44. 44. Document IndexingFSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);IndexWriter writer = new IndexWriter(dir, config);Document doc = new Document();doc.add(new Field("super_name", "Sandman", Field.Store.YES,Field.Index.ANALYZED));doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED));doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED));writer.addDocument(doc);writer.close();
  45. 45. Field Options• Field.Store – NO: field value is not stored – YES: field value is stored• Field.Index – ANALYZED: field is analyzed by the analyzer – NO: not indexed – NOT_ANALYZED: indexed but not analyzed
  46. 46. Luke• Lucene diagnostic tool: http://code.google.com/p/luke/
  47. 47. Document: Fields Representationhttp://www.bbc.co.uk/news/technology-15365207
  48. 48. Document: Fields Representationhttp://www.bbc.co.uk/news/technology-15365207 URL: Index.Store.YES, Data: Index.Store.YES, Field.Index.NO Field.Index.NOT_ANALYZED Authors: Index.Store.YES, Field.Index.ANALYZED Title: Index.Store.NO, Field.Index.ANALYZED Abstract: Index.Store.NO, Field.Index.ANALYZED Content: Index.Store.NO, Field.Index.ANALYZED Comments: Index.Store.NO, Field.Index.ANALYZED
  49. 49. Text Operations• Tokenization – split text in token• Stop word elimination – remove common words and closed word class (e.g. function words)• Stemming – reducing inflected (or sometimes derived) words to their stem
  50. 50. Analyzers• KeywordAnalyzer – Returns a single token• WhitespaceAnalyzer – Splits tokens at whitespace• Simple Analyzer – Divides text at nonletter characters and lowercases• StopAnalyzer – Divides text at non letter characters, lowercases, and removes stop words• StandardAnalyzer – Tokenizes based on sophisticated grammar that recognizes e- mail addresses, acronyms, etc.; lowercases and removes stop words (optional)
  51. 51. Analyzers The quick brown fox jumped over the lazy dogs• WhitespaceAnalyzer: [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]• SimpleAnalyzer: [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]• StopAnalyzer: [quick] [brown] [fox] [jumped] [lazy] [dogs]• StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
  52. 52. Analyzers "XY&Z Corporation - xyz@example.com"• WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com]• SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]• StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  53. 53. Tokenizers• A Tokenizer is a TokenStream whose input is a Reader• Tokenizers break field text into tokens• StandardTokenizer – source string: “full-text lucene.apache.org” – “full” “text” “lucene.apache.org”• WhitespaceTokenizer – “full-text” “lucene.apache.org”• LetterTokenizer – “full” “text” “lucene” “apache” “org”
  54. 54. TokenFilters• A TokenFilter is a TokenStream whose input is another token stream• LowerCaseFilter• StopFilter• LengthFilter• PorterStemFilter – stemming: reducing words to root form – rides, ride, riding => ride – country, countries => countri
  55. 55. Analyzerpublic class MyAnalyzer2 extends Analyzer {private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET)); @Override public TokenStream tokenStream(String string, Reader reader) { TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader); ts = new LowerCaseFilter(Version.LUCENE_36, ts); ts = new StopFilter(Version.LUCENE_36, ts, sw); return ts; }}
  56. 56. TermVectors• Store information about term frequency, positions and offsets• Relevance Feedback and “More Like This”• Clustering• Similarity between two documents• Highlighter – needs offsets info
  57. 57. Lucene TermVector• TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance• As a tuple: – termFreq = <term, term countD> – <fieldName, <…,termFreqi, termFreqi+1,…>>• As Java: – public String getField(); – public String[] getTerms(); – public int[] getTermFrequencies()
  58. 58. Lucene TermPositionVector• TermPositionVector is a representation of all of the terms, term counts, term position and term offsets in a specific Field of a Document instance• As Java: – public String getField(); – public String[] getTerms(); – public int[] getTermFrequencies() – public int[] getTermPositions(int index) – public TermVectorOffsetInfo[] getOffsets(int index)
  59. 59. Store TermVector• During indexing, create a Field that stores Term• Vectors: – new Field(“text",text,Field.Store.YES,Field.Index.A NALYZED,Field.TermVector.YES);• Options are: – Field.TermVector.YES – Field.TermVector.NO – Field.TermVector.WITH_POSITIONS – Field.TermVector.WITH_OFFSETS – Field.TermVector.WITH_POSITIONS_OFFSETS
  60. 60. Term Vector: Positions and Offsets Positions 0 1 2 3 4 5 6 7 8The quick brown fox jumped over the lazy dogs0<--->3 4<----->9 10<----->15 16<--->19 20 <------> 26 27<---->31 32<--->35 36<---->40 41<---->45 Offsets
  61. 61. Example• Main method with three parameters – Input_dir: directory to index – Index_dir: directory containing the index files – Flag (string) corresponding to different Field.TermVector... options • t: tokenize • tv: term vectors • tvp: term vectors with positions • tvo: term vectors with positions and offsets• The constructor of the Index class takes in input the three parameters and indexes the directory – Filters only “.txt” files
  62. 62. Lucene Search• IndexReader: interface for accessing an index• QueryParser: parses the user query• IndexSearcher: implements search over a single IndexReader – search(Query query, int numDoc) – TopDocs -> result of search • TopDocs.scoreDocs returns an array of retrieved documents
  63. 63. Lucene SearchDirectory dir = FSDirectory.open(new File(args[0]));IndexReader reader = IndexReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer);Query query = parser.parse(args[1]);TopDocs topDocs = searcher.search(query, 100);System.out.println("matches: " + topDocs.totalHits);for (int i = 0; i < topDocs.scoreDocs.length; i++) { Document doc = reader.document(topDocs.scoreDocs[i].doc); System.out.println("name: " + doc.get("name") + ", score: " + topDocs.scoreDocs[i].score); System.out.println();}searcher.close();
  64. 64. Lucene QueryParser• Example: – queryParser.parse(“name:SpiderMan”);• good human entered queries, debugging• does text analysis and constructs appropriate queries• not all query types supported
  65. 65. QUERY SYNTAX 1/2• title:"The Right Way" AND text:go – Phrase query + term query• Wildcard Searches – tes* (test - tests - tester) – te?t (test – text) – te*t (tempt) – tes?• Fuzzy Searches (Levenshtein Distance) (TERM) – roam~ (foam – roams) – roam~0.8• Range Searches – mod_date:[20020101 TO 20030101] – title:{Aida TO Carmen}• Proximity Searches (PHRASE) – "jakarta apache"~10
  66. 66. QUERY SYNTAX 2/2• Boosting a Term – jakarta^4 apache – "jakarta apache"^4 "Apache Lucene”• Boolean Operator – NOT, OR, AND – + required operator – - prohibit operator• Grouping by ( )• Field Grouping• title:(+return +"pink panther")• Escaping Special Characters by
  67. 67. QUERY (BY CODE) 1/2TermQuery query =new TermQuery(new Term(“name”,”Spider-Man”))• explicit, no escaping necessary• does not do text analysis for you• Query – TermQuery – BooleanQuery – PhraseQuery – FuzzyQuery / WildcardQuery /
  68. 68. QUERY (BY CODE) 2/2TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”))BooleanQuery bq = new BooleanQuery();bq.add(tq1, BooleanClause.Occur.MUST);bq.add(tq2, BooleanClause.Occur.SHOULD);
  69. 69. DELETING DOCUMENTSIndexWriter – deleteDocuments(Term t) – deleteDocuments(Term… t) – deleteDocuments(Query t) – deleteDocuments(Query… t) – updateDocument(Term t, Document d) – updateDocument(Term t, Document d, Analyzer a) – deleting does not immediately reclaim space
  70. 70. Relevance feedbackDocuments similarityApache TikaADVANCED TOPICS
  71. 71. Relevance feedback• Improve retrieval performance using information about document relevance – Explicit: relevance indicated by the user (assessor) – Implicit: from user behavior – Blind (or pseudo): using information about the top k retrieved documents
  72. 72. Relevance feedbackUser query New retrieved Searcher documents Retrieved Relevance feedback documents New query
  73. 73. Rocchio Algorithm  1  1 Qnew    Q      Dj      Dk Drel D j Drel Dnorel Dk Dnorel Original query Relevant document Non-relevant document
  74. 74. Rocchio Algorithm (blind)  1  1 Qnew    Q      Dj      Dk Drel D j Drel Dnorel Dk Dnorel In blind (pseudo) relevance feedback we know only relevant documents (supposed to be the top K documents) (generally adopted by search engine)
  75. 75. Relevance feedback Q = Alice Qnew = Alice Cheshire Cat King bookIt’s a friend of mine — a Cheshire It’s a friend of mine — a CheshireCat,’ said Alice: ‘allow me to Cat,’ said Alice: ‘allow me tointroduce it. introduce it.It’s the oldest rule in the book,’ It’s the oldest rule in the book,’said the King. ‘Then it ought to said the King. ‘Then it ought tobe Number One,’ said Alice be Number One,’ said AliceAlice looked round eagerly, and found Alice book was reprinted andthat it was the Red Queen. ‘She’s grown a published in 1866.good deal!’ was her first remark. …Alice lay back, and closed her eyes. Therewas the Red Queen again, with thatincessant grin. Or was it the Cheshire catsgrin?
  76. 76. Relevance feedback in Lucene• Possible using TermVector – Retrieve the top K documents – Build the Qnew using TermVector – Re-query using Qnew
  77. 77. Document similarity• Documents are represented by vectors – Build the document vector using TermVector – Compute the similarity using cosine similarity• Implement a functionality like “similar to this document”
  78. 78. Apache Tika• Extract metadata and text from several file formats – XML, HTML, OpenOffice, Microsoft Office, PDF – MBOX format (email) – Compressed files – EXIF metadata from image – Metadata from audio file (MP3)• Apache project as Lucene
  79. 79. Extract metadata and textTika tika = new Tika();String type = tika.detect(new File(args[0]));System.out.println("File type: " + type);Metadata metadata = new Metadata();String text = tika.parseToString(new FileInputStream(new File(args[0])), metadata);System.out.println("Metadata");String[] names = metadata.names();for (int i = 0; i < names.length; i++) { String value = metadata.get(names[i]); if (value != null) { System.out.println(names[i] + "=" + value); }}System.out.println("Text: " + text);
  80. 80. Apache Tika + LuceneFiles/URLs metadata Tika Lucene text
  81. 81. CONCLUSIONS
  82. 82. Conclusions• A popular IR model – Vector Space Model• Lucene – provides API to build search engine• Apace Tika – extracts metadata and text from files and URLs• LET’S GO TO BUILD YOUR SEARCH ENGINE!
  83. 83. Lucene Contrib• analyzers: new analyzer – Arabic, Chinese, …• highlighter: a set of classes for highlighting matching terms in search results• Snowball: stemmer analyzer based on Snowball library http://snowball.tartarus.org• spellchecker: tools for spellchecking and suggestions with Lucene
  84. 84. Lucene related project• Apache Solr: open source enterprise search platform from the Apache Lucene – Full-Text Search Capabilities, High Volume Web Traffic, HTML Administration Interfaces• Apache Nutch: web-search engine based on Solr and Lucene – crawler, link-graph database, HTML
  85. 85. Thank you for your attention! Questions?

×