0
Lucenefromtheory to real world<br />Information retrieval<br />Indexing<br />Cluster<br />Apache<br />Performance tuning<b...
Agenda<br />Introduction to Information Retrieval<br />Luceneoverview<br />Lucene in details<br />Search applications desi...
www.xebia.fr / blog.xebia.fr<br />3<br />Information Retrieval<br />
www.xebia.fr / blog.xebia.fr<br />4<br />Information Retrieval<br />“ Information Retrieval (IR) is the science of searchi...
5<br />
www.xebia.fr / blog.xebia.fr<br />6<br />Inverted Index<br />
www.xebia.fr / blog.xebia.fr<br />7<br />Boolean Model<br /><ul><li>Query and documents are conceived as sets of terms</li...
www.xebia.fr / blog.xebia.fr<br />8<br />VectorSpace Model<br /><ul><li>Documents and queries are represented as vectors
Similaritycanbecomputedwith :</li></ul>dj = (w1,j,w2,j,...,wt,j)<br />q = (w1,q,w2,q,...,wt,q)<br />
www.xebia.fr / blog.xebia.fr<br />9<br />Lucene<br />
www.xebia.fr / blog.xebia.fr<br />10<br />Lucene : where do we come from ?<br />
www.xebia.fr / blog.xebia.fr<br />11<br />Lucene documentation<br />
www.xebia.fr / blog.xebia.fr<br />12<br />Lucene : Simple indexingexample<br />Directorydirectory= new RAMDirectory();<br ...
www.xebia.fr / blog.xebia.fr<br />13<br />Lucene : Simplesearchexample<br />IndexSearcher searcher = new IndexSearcher(dir...
www.xebia.fr / blog.xebia.fr<br />14<br />Lucene - indexing<br />
www.xebia.fr / blog.xebia.fr<br />15<br />Lucene - analyzers<br />
www.xebia.fr / blog.xebia.fr<br />16<br />Lucene– Field types<br /><ul><li>Store : YES / NO
Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES</li></li></ul><li>www.xebia.fr / blog.xebia...
www.xebia.fr / blog.xebia.fr<br />18<br />Lucenestorage - segments<br /><ul><li>A new segment iscreatedeach time IndexWrit...
When documents are deleted, a marker isadded in the current segment</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />19...
Or automaticallymergeddepending on :</li></ul> (int) log(max(minMergeMB, size))/log(mergeFactor)<br />
www.xebia.fr / blog.xebia.fr<br />20<br />Lucene - search<br />
www.xebia.fr / blog.xebia.fr<br />21<br />Lucene- search<br /><ul><li>Programatic API
TermQuery
PhraseQuery
WildcardQuery
RangeQuery
FuzzyQuery
BooleanQuery</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />22<br />Lucene- QueryParser<br /><ul><li>QueryParserbuild...
3 layers :
QueryParser : Transforms a query string into an Abstract SyntaxTreerepresentation
QueryNodeProcessor : Processesnodes of the tree to move, remove or modifythem
QueryBuilder : builds a LuceneBooleanQuerytreefrom the abstract syntaxtree</li></li></ul><li>www.xebia.fr / blog.xebia.fr<...
www.xebia.fr / blog.xebia.fr<br />25<br />Lucene– PhraseQuery & SpanQuery<br /><ul><li>SpanQuery : match documents thatcon...
PhraseQuery : SpanQuerywith a slop value of 0
Uses position information</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />26<br />Lucenestorage – approximativequeries...
www.xebia.fr / blog.xebia.fr<br />27<br />Inverted Index<br />
www.xebia.fr / blog.xebia.fr<br />28<br />Lucene – Levenshtein distance<br /><ul><li>FuzzyQuery uses Levenshtein distance :
the number of modifications required to switchfrom one word to another</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br /...
LUCENE-2089 will use a Levenshteinautomaton</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />30<br />Lucene – Highlight...
Can befullycustomized
By default limited to 50 KB characters
Uses FastVectorHighlighter for fasterresults (~2.5 times faster)</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />31<br...
Upcoming SlideShare
Loading in...5
×

Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

2,117

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,117
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
48
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world"

  1. 1. Lucenefromtheory to real world<br />Information retrieval<br />Indexing<br />Cluster<br />Apache<br />Performance tuning<br />Parser<br />Dictionary<br />IndexReader<br />Solr<br />Real world<br />Java<br />Analysis<br />Troubleshooting<br />Vector<br />Relevance<br />Query<br />Server<br />Design<br />Fields<br />Document<br />Probabilistic<br />Production<br />Model<br />Search application<br />Open Source<br />Inverted index<br />Doug Cutting<br />Library<br />Architecture<br />
  2. 2. Agenda<br />Introduction to Information Retrieval<br />Luceneoverview<br />Lucene in details<br />Search applications design<br />Performance tuning<br />www.xebia.fr / blog.xebia.fr<br />2<br />
  3. 3. www.xebia.fr / blog.xebia.fr<br />3<br />Information Retrieval<br />
  4. 4. www.xebia.fr / blog.xebia.fr<br />4<br />Information Retrieval<br />“ Information Retrieval (IR) is the science of searching for document ”<br />
  5. 5. 5<br />
  6. 6. www.xebia.fr / blog.xebia.fr<br />6<br />Inverted Index<br />
  7. 7. www.xebia.fr / blog.xebia.fr<br />7<br />Boolean Model<br /><ul><li>Query and documents are conceived as sets of terms</li></ul> Q = (T1 OR T2) AND (T3 OR T4)<br /> D1 = {T1, T3}<br /> D2 = {T2, T3, T4}<br /><ul><li>Results set of queryisa composition of unions and intersections</li></ul>R = {D1, D2}<br />withUnion for OR operator<br /> Intersection for AND operator<br />
  8. 8. www.xebia.fr / blog.xebia.fr<br />8<br />VectorSpace Model<br /><ul><li>Documents and queries are represented as vectors
  9. 9. Similaritycanbecomputedwith :</li></ul>dj = (w1,j,w2,j,...,wt,j)<br />q = (w1,q,w2,q,...,wt,q)<br />
  10. 10. www.xebia.fr / blog.xebia.fr<br />9<br />Lucene<br />
  11. 11. www.xebia.fr / blog.xebia.fr<br />10<br />Lucene : where do we come from ?<br />
  12. 12. www.xebia.fr / blog.xebia.fr<br />11<br />Lucene documentation<br />
  13. 13. www.xebia.fr / blog.xebia.fr<br />12<br />Lucene : Simple indexingexample<br />Directorydirectory= new RAMDirectory();<br />IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);<br />Documentdoc = new Document();<br />doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));<br />doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));<br />writer.addDocument(doc);<br />writer.close();<br />
  14. 14. www.xebia.fr / blog.xebia.fr<br />13<br />Lucene : Simplesearchexample<br />IndexSearcher searcher = new IndexSearcher(dir, true);<br />Termt = new Term(“country”, “France”);<br />Queryquery = new TermQuery(t);<br />TopDocs docs = searcher.search(query, 10);<br />assertEquals(1, docs.totalHits);<br />searcher.close();<br />
  15. 15. www.xebia.fr / blog.xebia.fr<br />14<br />Lucene - indexing<br />
  16. 16. www.xebia.fr / blog.xebia.fr<br />15<br />Lucene - analyzers<br />
  17. 17. www.xebia.fr / blog.xebia.fr<br />16<br />Lucene– Field types<br /><ul><li>Store : YES / NO
  18. 18. Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
  19. 19. TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />17<br />Lucenestorage - segments<br />
  20. 20. www.xebia.fr / blog.xebia.fr<br />18<br />Lucenestorage - segments<br /><ul><li>A new segment iscreatedeach time IndexWriterisflushed
  21. 21. When documents are deleted, a marker isadded in the current segment</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />19<br />Lucenestorage – segments merge<br /><ul><li>Segments are mergedmanuallywithIndexWriter.optimize()
  22. 22. Or automaticallymergeddepending on :</li></ul> (int) log(max(minMergeMB, size))/log(mergeFactor)<br />
  23. 23. www.xebia.fr / blog.xebia.fr<br />20<br />Lucene - search<br />
  24. 24. www.xebia.fr / blog.xebia.fr<br />21<br />Lucene- search<br /><ul><li>Programatic API
  25. 25. TermQuery
  26. 26. PhraseQuery
  27. 27. WildcardQuery
  28. 28. RangeQuery
  29. 29. FuzzyQuery
  30. 30. BooleanQuery</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />22<br />Lucene- QueryParser<br /><ul><li>QueryParserbuild a Queryobjectfrom a user query string</li></ul> +JUNIT +ANT –MOCK<br /> +xebya~0.8<br /> +title:«Junit in action»<br /><ul><li>Most of the time,won’t fit application requirements</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />23<br />Lucene– contrib/QueryParser<br /><ul><li>Framework that simplifies the creation of a queryparserthat fit yourneeds
  31. 31. 3 layers :
  32. 32. QueryParser : Transforms a query string into an Abstract SyntaxTreerepresentation
  33. 33. QueryNodeProcessor : Processesnodes of the tree to move, remove or modifythem
  34. 34. QueryBuilder : builds a LuceneBooleanQuerytreefrom the abstract syntaxtree</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />24<br />Lucene – booleanqueries<br />
  35. 35. www.xebia.fr / blog.xebia.fr<br />25<br />Lucene– PhraseQuery & SpanQuery<br /><ul><li>SpanQuery : match documents thatcontainstermsseparated by n otherterms (n is the ‘slop’)
  36. 36. PhraseQuery : SpanQuerywith a slop value of 0
  37. 37. Uses position information</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />26<br />Lucenestorage – approximativequeries<br /><ul><li>Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) gettransformed to a set of TermQueries</li></ul>Dictionnary = { court, cours, courir }<br />FuzzyQuery = cour<br />TransformedQuery = court OR cours<br />
  38. 38. www.xebia.fr / blog.xebia.fr<br />27<br />Inverted Index<br />
  39. 39. www.xebia.fr / blog.xebia.fr<br />28<br />Lucene – Levenshtein distance<br /><ul><li>FuzzyQuery uses Levenshtein distance :
  40. 40. the number of modifications required to switchfrom one word to another</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />29<br />Lucene - FuzzyQuery<br /><ul><li>Currentimplementation not optimal
  41. 41. LUCENE-2089 will use a Levenshteinautomaton</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />30<br />Lucene – Highlighter<br /><ul><li>Produces ready to use HTML snippetswithhighlightedwordsfromquery
  42. 42. Can befullycustomized
  43. 43. By default limited to 50 KB characters
  44. 44. Uses FastVectorHighlighter for fasterresults (~2.5 times faster)</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />31<br />Lucene – FieldCache<br /><ul><li>Lucene cache thatallows to store in memory values of a single field
  45. 45. Usedinternally by Sortobjects
  46. 46. Can beused to manuallyload values of a single field :</li></ul>float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);<br />
  47. 47. www.xebia.fr / blog.xebia.fr<br />32<br />Lucene – MoreLikeThis<br /><ul><li>Findssimilar documents
  48. 48. Produces a query to besearched</li></ul>MoreLikeThismlt = new MoreLikeThis(reader);<br />mlt.setFieldNames(new String[] {"title", "author"});<br />mlt.setMinTermFreq(1);<br />mlt.setMinDocFreq(1);<br />Queryquery = mlt.like(docId);<br />indexSearcher.search(query, 10);<br />
  49. 49. www.xebia.fr / blog.xebia.fr<br />33<br />Lucene – FunctionQueries<br /><ul><li>Allows score customization
  50. 50. ConsiderusingFieldCaches to Reducefetchingcost</li></ul>FieldScoreQueryscoreQuery = new FieldScoreQuery("score",<br />FieldScoreQuery.Type.BYTE);<br />CustomScoreQuerycustomQ = new CustomScoreQuery(q, scoreQuery ) {<br /> public floatcustomScore(int doc,<br />floatsubQueryScore,<br />floatvalSrcScore) {<br /> return (float) (Math.sqrt(subQueryScore) * valSrcScore);<br /> }<br />};<br />
  51. 51. www.xebia.fr / blog.xebia.fr<br />34<br />Lucene – Luke<br />
  52. 52. www.xebia.fr / blog.xebia.fr<br />35<br />Lucene – Global performance tuning<br /><ul><li>Considerusing SSD for lowlatency
  53. 53. ConsiderusingRAMDirectory / InstanciatedIndex
  54. 54. Uses latest version of Lucene
  55. 55. Uses NIODirectory for Unix and MMAPDirectory for Windows
  56. 56. Try to turn off setUseCompoundFile</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />36<br />Lucene – Indexing performance tuning<br /><ul><li>Set RAMBufferSizeMBaccording to yourneeds
  57. 57. Tune yourmergepolicywith care</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />37<br />Lucene – Search performance tuning<br /><ul><li>Open IndexReader in read-only mode (default in Lucene 2.9+)
  58. 58. WarmupFieldCache to ensureimmediateaccesswhensorting
  59. 59. Limit use of TermVector
  60. 60. Ensure index isoptimized</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />38<br />Architecture withHibernateSearch<br />
  61. 61. www.xebia.fr / blog.xebia.fr<br />39<br />Architecture withSolr<br />
  62. 62. www.xebia.fr / blog.xebia.fr<br />40<br />Architecture withInfinispan<br />
  63. 63. www.xebia.fr / blog.xebia.fr<br />41<br />Lucene – Distributed : Katta<br /><ul><li>Shards and distributesLucene index over instances
  64. 64. Uses Hadoop for distribution</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />42<br />Lucenegalaxy<br /><ul><li>Apache Nutch : Lucene + Crawling and parsing
  65. 65. Apache Compass : Search engine framework
  66. 66. Apache Solr : Lucenestandalonesearch server
  67. 67. Apache Mahout : Distributed machine learning
  68. 68. HibernateSearch : Hibernate + Lucene
  69. 69. Katta : DistributedLucenewithHadoop</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />43<br />Lucene - Futures<br /><ul><li>FlexBranch : makingLuceneeven more customizable
  70. 70. Apache Mahout : distributed machine learning for clustering, classification and recommendationalgorithms</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />44<br />Questions ?<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×