Your SlideShare is downloading. ×
0
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

2,103

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,103
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
48
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Lucenefromtheory to real world<br />Information retrieval<br />Indexing<br />Cluster<br />Apache<br />Performance tuning<br />Parser<br />Dictionary<br />IndexReader<br />Solr<br />Real world<br />Java<br />Analysis<br />Troubleshooting<br />Vector<br />Relevance<br />Query<br />Server<br />Design<br />Fields<br />Document<br />Probabilistic<br />Production<br />Model<br />Search application<br />Open Source<br />Inverted index<br />Doug Cutting<br />Library<br />Architecture<br />
  • 2. Agenda<br />Introduction to Information Retrieval<br />Luceneoverview<br />Lucene in details<br />Search applications design<br />Performance tuning<br />www.xebia.fr / blog.xebia.fr<br />2<br />
  • 3. www.xebia.fr / blog.xebia.fr<br />3<br />Information Retrieval<br />
  • 4. www.xebia.fr / blog.xebia.fr<br />4<br />Information Retrieval<br />“ Information Retrieval (IR) is the science of searching for document ”<br />
  • 5. 5<br />
  • 6. www.xebia.fr / blog.xebia.fr<br />6<br />Inverted Index<br />
  • 7. www.xebia.fr / blog.xebia.fr<br />7<br />Boolean Model<br /><ul><li>Query and documents are conceived as sets of terms</li></ul> Q = (T1 OR T2) AND (T3 OR T4)<br /> D1 = {T1, T3}<br /> D2 = {T2, T3, T4}<br /><ul><li>Results set of queryisa composition of unions and intersections</li></ul>R = {D1, D2}<br />withUnion for OR operator<br /> Intersection for AND operator<br />
  • 8. www.xebia.fr / blog.xebia.fr<br />8<br />VectorSpace Model<br /><ul><li>Documents and queries are represented as vectors
  • 9. Similaritycanbecomputedwith :</li></ul>dj = (w1,j,w2,j,...,wt,j)<br />q = (w1,q,w2,q,...,wt,q)<br />
  • 10. www.xebia.fr / blog.xebia.fr<br />9<br />Lucene<br />
  • 11. www.xebia.fr / blog.xebia.fr<br />10<br />Lucene : where do we come from ?<br />
  • 12. www.xebia.fr / blog.xebia.fr<br />11<br />Lucene documentation<br />
  • 13. www.xebia.fr / blog.xebia.fr<br />12<br />Lucene : Simple indexingexample<br />Directorydirectory= new RAMDirectory();<br />IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);<br />Documentdoc = new Document();<br />doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));<br />doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));<br />writer.addDocument(doc);<br />writer.close();<br />
  • 14. www.xebia.fr / blog.xebia.fr<br />13<br />Lucene : Simplesearchexample<br />IndexSearcher searcher = new IndexSearcher(dir, true);<br />Termt = new Term(“country”, “France”);<br />Queryquery = new TermQuery(t);<br />TopDocs docs = searcher.search(query, 10);<br />assertEquals(1, docs.totalHits);<br />searcher.close();<br />
  • 15. www.xebia.fr / blog.xebia.fr<br />14<br />Lucene - indexing<br />
  • 16. www.xebia.fr / blog.xebia.fr<br />15<br />Lucene - analyzers<br />
  • 17. www.xebia.fr / blog.xebia.fr<br />16<br />Lucene– Field types<br /><ul><li>Store : YES / NO
  • 18. Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
  • 19. TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />17<br />Lucenestorage - segments<br />
  • 20. www.xebia.fr / blog.xebia.fr<br />18<br />Lucenestorage - segments<br /><ul><li>A new segment iscreatedeach time IndexWriterisflushed
  • 21. When documents are deleted, a marker isadded in the current segment</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />19<br />Lucenestorage – segments merge<br /><ul><li>Segments are mergedmanuallywithIndexWriter.optimize()
  • 22. Or automaticallymergeddepending on :</li></ul> (int) log(max(minMergeMB, size))/log(mergeFactor)<br />
  • 23. www.xebia.fr / blog.xebia.fr<br />20<br />Lucene - search<br />
  • 24. www.xebia.fr / blog.xebia.fr<br />21<br />Lucene- search<br /><ul><li>Programatic API
  • 25. TermQuery
  • 26. PhraseQuery
  • 27. WildcardQuery
  • 28. RangeQuery
  • 29. FuzzyQuery
  • 30. BooleanQuery</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />22<br />Lucene- QueryParser<br /><ul><li>QueryParserbuild a Queryobjectfrom a user query string</li></ul> +JUNIT +ANT –MOCK<br /> +xebya~0.8<br /> +title:«Junit in action»<br /><ul><li>Most of the time,won’t fit application requirements</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />23<br />Lucene– contrib/QueryParser<br /><ul><li>Framework that simplifies the creation of a queryparserthat fit yourneeds
  • 31. 3 layers :
  • 32. QueryParser : Transforms a query string into an Abstract SyntaxTreerepresentation
  • 33. QueryNodeProcessor : Processesnodes of the tree to move, remove or modifythem
  • 34. QueryBuilder : builds a LuceneBooleanQuerytreefrom the abstract syntaxtree</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />24<br />Lucene – booleanqueries<br />
  • 35. www.xebia.fr / blog.xebia.fr<br />25<br />Lucene– PhraseQuery &amp; SpanQuery<br /><ul><li>SpanQuery : match documents thatcontainstermsseparated by n otherterms (n is the ‘slop’)
  • 36. PhraseQuery : SpanQuerywith a slop value of 0
  • 37. Uses position information</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />26<br />Lucenestorage – approximativequeries<br /><ul><li>Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) gettransformed to a set of TermQueries</li></ul>Dictionnary = { court, cours, courir }<br />FuzzyQuery = cour<br />TransformedQuery = court OR cours<br />
  • 38. www.xebia.fr / blog.xebia.fr<br />27<br />Inverted Index<br />
  • 39. www.xebia.fr / blog.xebia.fr<br />28<br />Lucene – Levenshtein distance<br /><ul><li>FuzzyQuery uses Levenshtein distance :
  • 40. the number of modifications required to switchfrom one word to another</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />29<br />Lucene - FuzzyQuery<br /><ul><li>Currentimplementation not optimal
  • 41. LUCENE-2089 will use a Levenshteinautomaton</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />30<br />Lucene – Highlighter<br /><ul><li>Produces ready to use HTML snippetswithhighlightedwordsfromquery
  • 42. Can befullycustomized
  • 43. By default limited to 50 KB characters
  • 44. Uses FastVectorHighlighter for fasterresults (~2.5 times faster)</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />31<br />Lucene – FieldCache<br /><ul><li>Lucene cache thatallows to store in memory values of a single field
  • 45. Usedinternally by Sortobjects
  • 46. Can beused to manuallyload values of a single field :</li></ul>float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);<br />
  • 47. www.xebia.fr / blog.xebia.fr<br />32<br />Lucene – MoreLikeThis<br /><ul><li>Findssimilar documents
  • 48. Produces a query to besearched</li></ul>MoreLikeThismlt = new MoreLikeThis(reader);<br />mlt.setFieldNames(new String[] {&quot;title&quot;, &quot;author&quot;});<br />mlt.setMinTermFreq(1);<br />mlt.setMinDocFreq(1);<br />Queryquery = mlt.like(docId);<br />indexSearcher.search(query, 10);<br />
  • 49. www.xebia.fr / blog.xebia.fr<br />33<br />Lucene – FunctionQueries<br /><ul><li>Allows score customization
  • 50. ConsiderusingFieldCaches to Reducefetchingcost</li></ul>FieldScoreQueryscoreQuery = new FieldScoreQuery(&quot;score&quot;,<br />FieldScoreQuery.Type.BYTE);<br />CustomScoreQuerycustomQ = new CustomScoreQuery(q, scoreQuery ) {<br /> public floatcustomScore(int doc,<br />floatsubQueryScore,<br />floatvalSrcScore) {<br /> return (float) (Math.sqrt(subQueryScore) * valSrcScore);<br /> }<br />};<br />
  • 51. www.xebia.fr / blog.xebia.fr<br />34<br />Lucene – Luke<br />
  • 52. www.xebia.fr / blog.xebia.fr<br />35<br />Lucene – Global performance tuning<br /><ul><li>Considerusing SSD for lowlatency
  • 53. ConsiderusingRAMDirectory / InstanciatedIndex
  • 54. Uses latest version of Lucene
  • 55. Uses NIODirectory for Unix and MMAPDirectory for Windows
  • 56. Try to turn off setUseCompoundFile</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />36<br />Lucene – Indexing performance tuning<br /><ul><li>Set RAMBufferSizeMBaccording to yourneeds
  • 57. Tune yourmergepolicywith care</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />37<br />Lucene – Search performance tuning<br /><ul><li>Open IndexReader in read-only mode (default in Lucene 2.9+)
  • 58. WarmupFieldCache to ensureimmediateaccesswhensorting
  • 59. Limit use of TermVector
  • 60. Ensure index isoptimized</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />38<br />Architecture withHibernateSearch<br />
  • 61. www.xebia.fr / blog.xebia.fr<br />39<br />Architecture withSolr<br />
  • 62. www.xebia.fr / blog.xebia.fr<br />40<br />Architecture withInfinispan<br />
  • 63. www.xebia.fr / blog.xebia.fr<br />41<br />Lucene – Distributed : Katta<br /><ul><li>Shards and distributesLucene index over instances
  • 64. Uses Hadoop for distribution</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />42<br />Lucenegalaxy<br /><ul><li>Apache Nutch : Lucene + Crawling and parsing
  • 65. Apache Compass : Search engine framework
  • 66. Apache Solr : Lucenestandalonesearch server
  • 67. Apache Mahout : Distributed machine learning
  • 68. HibernateSearch : Hibernate + Lucene
  • 69. Katta : DistributedLucenewithHadoop</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />43<br />Lucene - Futures<br /><ul><li>FlexBranch : makingLuceneeven more customizable
  • 70. Apache Mahout : distributed machine learning for clustering, classification and recommendationalgorithms</li></li></ul><li>www.xebia.fr / blog.xebia.fr<br />44<br />Questions ?<br />

×