Your SlideShare is downloading. ×
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

2,065
views

Published on

Published in: Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,065
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
47
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Lucenefromtheory to real world
    Information retrieval
    Indexing
    Cluster
    Apache
    Performance tuning
    Parser
    Dictionary
    IndexReader
    Solr
    Real world
    Java
    Analysis
    Troubleshooting
    Vector
    Relevance
    Query
    Server
    Design
    Fields
    Document
    Probabilistic
    Production
    Model
    Search application
    Open Source
    Inverted index
    Doug Cutting
    Library
    Architecture
  • 2. Agenda
    Introduction to Information Retrieval
    Luceneoverview
    Lucene in details
    Search applications design
    Performance tuning
    www.xebia.fr / blog.xebia.fr
    2
  • 3. www.xebia.fr / blog.xebia.fr
    3
    Information Retrieval
  • 4. www.xebia.fr / blog.xebia.fr
    4
    Information Retrieval
    “ Information Retrieval (IR) is the science of searching for document ”
  • 5. 5
  • 6. www.xebia.fr / blog.xebia.fr
    6
    Inverted Index
  • 7. www.xebia.fr / blog.xebia.fr
    7
    Boolean Model
    • Query and documents are conceived as sets of terms
    Q = (T1 OR T2) AND (T3 OR T4)
    D1 = {T1, T3}
    D2 = {T2, T3, T4}
    • Results set of queryisa composition of unions and intersections
    R = {D1, D2}
    withUnion for OR operator
    Intersection for AND operator
  • 8. www.xebia.fr / blog.xebia.fr
    8
    VectorSpace Model
    • Documents and queries are represented as vectors
    • 9. Similaritycanbecomputedwith :
    dj = (w1,j,w2,j,...,wt,j)
    q = (w1,q,w2,q,...,wt,q)
  • 10. www.xebia.fr / blog.xebia.fr
    9
    Lucene
  • 11. www.xebia.fr / blog.xebia.fr
    10
    Lucene : where do we come from ?
  • 12. www.xebia.fr / blog.xebia.fr
    11
    Lucene documentation
  • 13. www.xebia.fr / blog.xebia.fr
    12
    Lucene : Simple indexingexample
    Directorydirectory= new RAMDirectory();
    IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
    Documentdoc = new Document();
    doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));
    writer.addDocument(doc);
    writer.close();
  • 14. www.xebia.fr / blog.xebia.fr
    13
    Lucene : Simplesearchexample
    IndexSearcher searcher = new IndexSearcher(dir, true);
    Termt = new Term(“country”, “France”);
    Queryquery = new TermQuery(t);
    TopDocs docs = searcher.search(query, 10);
    assertEquals(1, docs.totalHits);
    searcher.close();
  • 15. www.xebia.fr / blog.xebia.fr
    14
    Lucene - indexing
  • 16. www.xebia.fr / blog.xebia.fr
    15
    Lucene - analyzers
  • 17. www.xebia.fr / blog.xebia.fr
    16
    Lucene– Field types
    • Store : YES / NO
    • 18. Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
    • 19. TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES
  • www.xebia.fr / blog.xebia.fr
    17
    Lucenestorage - segments
  • 20. www.xebia.fr / blog.xebia.fr
    18
    Lucenestorage - segments
    • A new segment iscreatedeach time IndexWriterisflushed
    • 21. When documents are deleted, a marker isadded in the current segment
  • www.xebia.fr / blog.xebia.fr
    19
    Lucenestorage – segments merge
    • Segments are mergedmanuallywithIndexWriter.optimize()
    • 22. Or automaticallymergeddepending on :
    (int) log(max(minMergeMB, size))/log(mergeFactor)
  • 23. www.xebia.fr / blog.xebia.fr
    20
    Lucene - search
  • 24. www.xebia.fr / blog.xebia.fr
    21
    Lucene- search
  • www.xebia.fr / blog.xebia.fr
    22
    Lucene- QueryParser
    • QueryParserbuild a Queryobjectfrom a user query string
    +JUNIT +ANT –MOCK
    +xebya~0.8
    +title:«Junit in action»
    • Most of the time,won’t fit application requirements
  • www.xebia.fr / blog.xebia.fr
    23
    Lucene– contrib/QueryParser
    • Framework that simplifies the creation of a queryparserthat fit yourneeds
    • 31. 3 layers :
    • 32. QueryParser : Transforms a query string into an Abstract SyntaxTreerepresentation
    • 33. QueryNodeProcessor : Processesnodes of the tree to move, remove or modifythem
    • 34. QueryBuilder : builds a LuceneBooleanQuerytreefrom the abstract syntaxtree
  • www.xebia.fr / blog.xebia.fr
    24
    Lucene – booleanqueries
  • 35. www.xebia.fr / blog.xebia.fr
    25
    Lucene– PhraseQuery & SpanQuery
    • SpanQuery : match documents thatcontainstermsseparated by n otherterms (n is the ‘slop’)
    • 36. PhraseQuery : SpanQuerywith a slop value of 0
    • 37. Uses position information
  • www.xebia.fr / blog.xebia.fr
    26
    Lucenestorage – approximativequeries
    • Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) gettransformed to a set of TermQueries
    Dictionnary = { court, cours, courir }
    FuzzyQuery = cour
    TransformedQuery = court OR cours
  • 38. www.xebia.fr / blog.xebia.fr
    27
    Inverted Index
  • 39. www.xebia.fr / blog.xebia.fr
    28
    Lucene – Levenshtein distance
    • FuzzyQuery uses Levenshtein distance :
    • 40. the number of modifications required to switchfrom one word to another
  • www.xebia.fr / blog.xebia.fr
    29
    Lucene - FuzzyQuery
    • Currentimplementation not optimal
    • 41. LUCENE-2089 will use a Levenshteinautomaton
  • www.xebia.fr / blog.xebia.fr
    30
    Lucene – Highlighter
    • Produces ready to use HTML snippetswithhighlightedwordsfromquery
    • 42. Can befullycustomized
    • 43. By default limited to 50 KB characters
    • 44. Uses FastVectorHighlighter for fasterresults (~2.5 times faster)
  • www.xebia.fr / blog.xebia.fr
    31
    Lucene – FieldCache
    • Lucene cache thatallows to store in memory values of a single field
    • 45. Usedinternally by Sortobjects
    • 46. Can beused to manuallyload values of a single field :
    float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);
  • 47. www.xebia.fr / blog.xebia.fr
    32
    Lucene – MoreLikeThis
    • Findssimilar documents
    • 48. Produces a query to besearched
    MoreLikeThismlt = new MoreLikeThis(reader);
    mlt.setFieldNames(new String[] {"title", "author"});
    mlt.setMinTermFreq(1);
    mlt.setMinDocFreq(1);
    Queryquery = mlt.like(docId);
    indexSearcher.search(query, 10);
  • 49. www.xebia.fr / blog.xebia.fr
    33
    Lucene – FunctionQueries
    • Allows score customization
    • 50. ConsiderusingFieldCaches to Reducefetchingcost
    FieldScoreQueryscoreQuery = new FieldScoreQuery("score",
    FieldScoreQuery.Type.BYTE);
    CustomScoreQuerycustomQ = new CustomScoreQuery(q, scoreQuery ) {
    public floatcustomScore(int doc,
    floatsubQueryScore,
    floatvalSrcScore) {
    return (float) (Math.sqrt(subQueryScore) * valSrcScore);
    }
    };
  • 51. www.xebia.fr / blog.xebia.fr
    34
    Lucene – Luke
  • 52. www.xebia.fr / blog.xebia.fr
    35
    Lucene – Global performance tuning
    • Considerusing SSD for lowlatency
    • 53. ConsiderusingRAMDirectory / InstanciatedIndex
    • 54. Uses latest version of Lucene
    • 55. Uses NIODirectory for Unix and MMAPDirectory for Windows
    • 56. Try to turn off setUseCompoundFile
  • www.xebia.fr / blog.xebia.fr
    36
    Lucene – Indexing performance tuning
    • Set RAMBufferSizeMBaccording to yourneeds
    • 57. Tune yourmergepolicywith care
  • www.xebia.fr / blog.xebia.fr
    37
    Lucene – Search performance tuning
    • Open IndexReader in read-only mode (default in Lucene 2.9+)
    • 58. WarmupFieldCache to ensureimmediateaccesswhensorting
    • 59. Limit use of TermVector
    • 60. Ensure index isoptimized
  • www.xebia.fr / blog.xebia.fr
    38
    Architecture withHibernateSearch
  • 61. www.xebia.fr / blog.xebia.fr
    39
    Architecture withSolr
  • 62. www.xebia.fr / blog.xebia.fr
    40
    Architecture withInfinispan
  • 63. www.xebia.fr / blog.xebia.fr
    41
    Lucene – Distributed : Katta
    • Shards and distributesLucene index over instances
    • 64. Uses Hadoop for distribution
  • www.xebia.fr / blog.xebia.fr
    42
    Lucenegalaxy
    • Apache Nutch : Lucene + Crawling and parsing
    • 65. Apache Compass : Search engine framework
    • 66. Apache Solr : Lucenestandalonesearch server
    • 67. Apache Mahout : Distributed machine learning
    • 68. HibernateSearch : Hibernate + Lucene
    • 69. Katta : DistributedLucenewithHadoop
  • www.xebia.fr / blog.xebia.fr
    43
    Lucene - Futures
    • FlexBranch : makingLuceneeven more customizable
    • 70. Apache Mahout : distributed machine learning for clustering, classification and recommendationalgorithms
  • www.xebia.fr / blog.xebia.fr
    44
    Questions ?

×