Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

  • 2,014 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,014
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
47
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Lucenefromtheory to real world
    Information retrieval
    Indexing
    Cluster
    Apache
    Performance tuning
    Parser
    Dictionary
    IndexReader
    Solr
    Real world
    Java
    Analysis
    Troubleshooting
    Vector
    Relevance
    Query
    Server
    Design
    Fields
    Document
    Probabilistic
    Production
    Model
    Search application
    Open Source
    Inverted index
    Doug Cutting
    Library
    Architecture
  • 2. Agenda
    Introduction to Information Retrieval
    Luceneoverview
    Lucene in details
    Search applications design
    Performance tuning
    www.xebia.fr / blog.xebia.fr
    2
  • 3. www.xebia.fr / blog.xebia.fr
    3
    Information Retrieval
  • 4. www.xebia.fr / blog.xebia.fr
    4
    Information Retrieval
    “ Information Retrieval (IR) is the science of searching for document ”
  • 5. 5
  • 6. www.xebia.fr / blog.xebia.fr
    6
    Inverted Index
  • 7. www.xebia.fr / blog.xebia.fr
    7
    Boolean Model
    • Query and documents are conceived as sets of terms
    Q = (T1 OR T2) AND (T3 OR T4)
    D1 = {T1, T3}
    D2 = {T2, T3, T4}
    • Results set of queryisa composition of unions and intersections
    R = {D1, D2}
    withUnion for OR operator
    Intersection for AND operator
  • 8. www.xebia.fr / blog.xebia.fr
    8
    VectorSpace Model
    • Documents and queries are represented as vectors
    • 9. Similaritycanbecomputedwith :
    dj = (w1,j,w2,j,...,wt,j)
    q = (w1,q,w2,q,...,wt,q)
  • 10. www.xebia.fr / blog.xebia.fr
    9
    Lucene
  • 11. www.xebia.fr / blog.xebia.fr
    10
    Lucene : where do we come from ?
  • 12. www.xebia.fr / blog.xebia.fr
    11
    Lucene documentation
  • 13. www.xebia.fr / blog.xebia.fr
    12
    Lucene : Simple indexingexample
    Directorydirectory= new RAMDirectory();
    IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
    Documentdoc = new Document();
    doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));
    writer.addDocument(doc);
    writer.close();
  • 14. www.xebia.fr / blog.xebia.fr
    13
    Lucene : Simplesearchexample
    IndexSearcher searcher = new IndexSearcher(dir, true);
    Termt = new Term(“country”, “France”);
    Queryquery = new TermQuery(t);
    TopDocs docs = searcher.search(query, 10);
    assertEquals(1, docs.totalHits);
    searcher.close();
  • 15. www.xebia.fr / blog.xebia.fr
    14
    Lucene - indexing
  • 16. www.xebia.fr / blog.xebia.fr
    15
    Lucene - analyzers
  • 17. www.xebia.fr / blog.xebia.fr
    16
    Lucene– Field types
    • Store : YES / NO
    • 18. Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
    • 19. TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES
  • www.xebia.fr / blog.xebia.fr
    17
    Lucenestorage - segments
  • 20. www.xebia.fr / blog.xebia.fr
    18
    Lucenestorage - segments
    • A new segment iscreatedeach time IndexWriterisflushed
    • 21. When documents are deleted, a marker isadded in the current segment
  • www.xebia.fr / blog.xebia.fr
    19
    Lucenestorage – segments merge
    • Segments are mergedmanuallywithIndexWriter.optimize()
    • 22. Or automaticallymergeddepending on :
    (int) log(max(minMergeMB, size))/log(mergeFactor)
  • 23. www.xebia.fr / blog.xebia.fr
    20
    Lucene - search
  • 24. www.xebia.fr / blog.xebia.fr
    21
    Lucene- search
  • www.xebia.fr / blog.xebia.fr
    22
    Lucene- QueryParser
    • QueryParserbuild a Queryobjectfrom a user query string
    +JUNIT +ANT –MOCK
    +xebya~0.8
    +title:«Junit in action»
    • Most of the time,won’t fit application requirements
  • www.xebia.fr / blog.xebia.fr
    23
    Lucene– contrib/QueryParser
    • Framework that simplifies the creation of a queryparserthat fit yourneeds
    • 31. 3 layers :
    • 32. QueryParser : Transforms a query string into an Abstract SyntaxTreerepresentation
    • 33. QueryNodeProcessor : Processesnodes of the tree to move, remove or modifythem
    • 34. QueryBuilder : builds a LuceneBooleanQuerytreefrom the abstract syntaxtree
  • www.xebia.fr / blog.xebia.fr
    24
    Lucene – booleanqueries
  • 35. www.xebia.fr / blog.xebia.fr
    25
    Lucene– PhraseQuery & SpanQuery
    • SpanQuery : match documents thatcontainstermsseparated by n otherterms (n is the ‘slop’)
    • 36. PhraseQuery : SpanQuerywith a slop value of 0
    • 37. Uses position information
  • www.xebia.fr / blog.xebia.fr
    26
    Lucenestorage – approximativequeries
    • Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) gettransformed to a set of TermQueries
    Dictionnary = { court, cours, courir }
    FuzzyQuery = cour
    TransformedQuery = court OR cours
  • 38. www.xebia.fr / blog.xebia.fr
    27
    Inverted Index
  • 39. www.xebia.fr / blog.xebia.fr
    28
    Lucene – Levenshtein distance
    • FuzzyQuery uses Levenshtein distance :
    • 40. the number of modifications required to switchfrom one word to another
  • www.xebia.fr / blog.xebia.fr
    29
    Lucene - FuzzyQuery
    • Currentimplementation not optimal
    • 41. LUCENE-2089 will use a Levenshteinautomaton
  • www.xebia.fr / blog.xebia.fr
    30
    Lucene – Highlighter
    • Produces ready to use HTML snippetswithhighlightedwordsfromquery
    • 42. Can befullycustomized
    • 43. By default limited to 50 KB characters
    • 44. Uses FastVectorHighlighter for fasterresults (~2.5 times faster)
  • www.xebia.fr / blog.xebia.fr
    31
    Lucene – FieldCache
    • Lucene cache thatallows to store in memory values of a single field
    • 45. Usedinternally by Sortobjects
    • 46. Can beused to manuallyload values of a single field :
    float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);
  • 47. www.xebia.fr / blog.xebia.fr
    32
    Lucene – MoreLikeThis
    • Findssimilar documents
    • 48. Produces a query to besearched
    MoreLikeThismlt = new MoreLikeThis(reader);
    mlt.setFieldNames(new String[] {"title", "author"});
    mlt.setMinTermFreq(1);
    mlt.setMinDocFreq(1);
    Queryquery = mlt.like(docId);
    indexSearcher.search(query, 10);
  • 49. www.xebia.fr / blog.xebia.fr
    33
    Lucene – FunctionQueries
    • Allows score customization
    • 50. ConsiderusingFieldCaches to Reducefetchingcost
    FieldScoreQueryscoreQuery = new FieldScoreQuery("score",
    FieldScoreQuery.Type.BYTE);
    CustomScoreQuerycustomQ = new CustomScoreQuery(q, scoreQuery ) {
    public floatcustomScore(int doc,
    floatsubQueryScore,
    floatvalSrcScore) {
    return (float) (Math.sqrt(subQueryScore) * valSrcScore);
    }
    };
  • 51. www.xebia.fr / blog.xebia.fr
    34
    Lucene – Luke
  • 52. www.xebia.fr / blog.xebia.fr
    35
    Lucene – Global performance tuning
    • Considerusing SSD for lowlatency
    • 53. ConsiderusingRAMDirectory / InstanciatedIndex
    • 54. Uses latest version of Lucene
    • 55. Uses NIODirectory for Unix and MMAPDirectory for Windows
    • 56. Try to turn off setUseCompoundFile
  • www.xebia.fr / blog.xebia.fr
    36
    Lucene – Indexing performance tuning
    • Set RAMBufferSizeMBaccording to yourneeds
    • 57. Tune yourmergepolicywith care
  • www.xebia.fr / blog.xebia.fr
    37
    Lucene – Search performance tuning
    • Open IndexReader in read-only mode (default in Lucene 2.9+)
    • 58. WarmupFieldCache to ensureimmediateaccesswhensorting
    • 59. Limit use of TermVector
    • 60. Ensure index isoptimized
  • www.xebia.fr / blog.xebia.fr
    38
    Architecture withHibernateSearch
  • 61. www.xebia.fr / blog.xebia.fr
    39
    Architecture withSolr
  • 62. www.xebia.fr / blog.xebia.fr
    40
    Architecture withInfinispan
  • 63. www.xebia.fr / blog.xebia.fr
    41
    Lucene – Distributed : Katta
    • Shards and distributesLucene index over instances
    • 64. Uses Hadoop for distribution
  • www.xebia.fr / blog.xebia.fr
    42
    Lucenegalaxy
    • Apache Nutch : Lucene + Crawling and parsing
    • 65. Apache Compass : Search engine framework
    • 66. Apache Solr : Lucenestandalonesearch server
    • 67. Apache Mahout : Distributed machine learning
    • 68. HibernateSearch : Hibernate + Lucene
    • 69. Katta : DistributedLucenewithHadoop
  • www.xebia.fr / blog.xebia.fr
    43
    Lucene - Futures
    • FlexBranch : makingLuceneeven more customizable
    • 70. Apache Mahout : distributed machine learning for clustering, classification and recommendationalgorithms
  • www.xebia.fr / blog.xebia.fr
    44
    Questions ?