• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
 

Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world

on

  • 2,330 views

 

Statistics

Views

Total Views
2,330
Views on SlideShare
2,326
Embed Views
4

Actions

Likes
4
Downloads
46
Comments
0

2 Embeds 4

http://www.slideshare.net 3
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world Presentation Transcript

    • Lucenefromtheory to real world
      Information retrieval
      Indexing
      Cluster
      Apache
      Performance tuning
      Parser
      Dictionary
      IndexReader
      Solr
      Real world
      Java
      Analysis
      Troubleshooting
      Vector
      Relevance
      Query
      Server
      Design
      Fields
      Document
      Probabilistic
      Production
      Model
      Search application
      Open Source
      Inverted index
      Doug Cutting
      Library
      Architecture
    • Agenda
      Introduction to Information Retrieval
      Luceneoverview
      Lucene in details
      Search applications design
      Performance tuning
      www.xebia.fr / blog.xebia.fr
      2
    • www.xebia.fr / blog.xebia.fr
      3
      Information Retrieval
    • www.xebia.fr / blog.xebia.fr
      4
      Information Retrieval
      “ Information Retrieval (IR) is the science of searching for document ”
    • 5
    • www.xebia.fr / blog.xebia.fr
      6
      Inverted Index
    • www.xebia.fr / blog.xebia.fr
      7
      Boolean Model
      • Query and documents are conceived as sets of terms
      Q = (T1 OR T2) AND (T3 OR T4)
      D1 = {T1, T3}
      D2 = {T2, T3, T4}
      • Results set of queryisa composition of unions and intersections
      R = {D1, D2}
      withUnion for OR operator
      Intersection for AND operator
    • www.xebia.fr / blog.xebia.fr
      8
      VectorSpace Model
      • Documents and queries are represented as vectors
      • Similaritycanbecomputedwith :
      dj = (w1,j,w2,j,...,wt,j)
      q = (w1,q,w2,q,...,wt,q)
    • www.xebia.fr / blog.xebia.fr
      9
      Lucene
    • www.xebia.fr / blog.xebia.fr
      10
      Lucene : where do we come from ?
    • www.xebia.fr / blog.xebia.fr
      11
      Lucene documentation
    • www.xebia.fr / blog.xebia.fr
      12
      Lucene : Simple indexingexample
      Directorydirectory= new RAMDirectory();
      IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
      Documentdoc = new Document();
      doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));
      doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));
      writer.addDocument(doc);
      writer.close();
    • www.xebia.fr / blog.xebia.fr
      13
      Lucene : Simplesearchexample
      IndexSearcher searcher = new IndexSearcher(dir, true);
      Termt = new Term(“country”, “France”);
      Queryquery = new TermQuery(t);
      TopDocs docs = searcher.search(query, 10);
      assertEquals(1, docs.totalHits);
      searcher.close();
    • www.xebia.fr / blog.xebia.fr
      14
      Lucene - indexing
    • www.xebia.fr / blog.xebia.fr
      15
      Lucene - analyzers
    • www.xebia.fr / blog.xebia.fr
      16
      Lucene– Field types
      • Store : YES / NO
      • Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS
      • TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES
    • www.xebia.fr / blog.xebia.fr
      17
      Lucenestorage - segments
    • www.xebia.fr / blog.xebia.fr
      18
      Lucenestorage - segments
      • A new segment iscreatedeach time IndexWriterisflushed
      • When documents are deleted, a marker isadded in the current segment
    • www.xebia.fr / blog.xebia.fr
      19
      Lucenestorage – segments merge
      • Segments are mergedmanuallywithIndexWriter.optimize()
      • Or automaticallymergeddepending on :
      (int) log(max(minMergeMB, size))/log(mergeFactor)
    • www.xebia.fr / blog.xebia.fr
      20
      Lucene - search
    • www.xebia.fr / blog.xebia.fr
      21
      Lucene- search
      • Programatic API
      • TermQuery
      • PhraseQuery
      • WildcardQuery
      • RangeQuery
      • FuzzyQuery
      • BooleanQuery
    • www.xebia.fr / blog.xebia.fr
      22
      Lucene- QueryParser
      • QueryParserbuild a Queryobjectfrom a user query string
      +JUNIT +ANT –MOCK
      +xebya~0.8
      +title:«Junit in action»
      • Most of the time,won’t fit application requirements
    • www.xebia.fr / blog.xebia.fr
      23
      Lucene– contrib/QueryParser
      • Framework that simplifies the creation of a queryparserthat fit yourneeds
      • 3 layers :
      • QueryParser : Transforms a query string into an Abstract SyntaxTreerepresentation
      • QueryNodeProcessor : Processesnodes of the tree to move, remove or modifythem
      • QueryBuilder : builds a LuceneBooleanQuerytreefrom the abstract syntaxtree
    • www.xebia.fr / blog.xebia.fr
      24
      Lucene – booleanqueries
    • www.xebia.fr / blog.xebia.fr
      25
      Lucene– PhraseQuery & SpanQuery
      • SpanQuery : match documents thatcontainstermsseparated by n otherterms (n is the ‘slop’)
      • PhraseQuery : SpanQuerywith a slop value of 0
      • Uses position information
    • www.xebia.fr / blog.xebia.fr
      26
      Lucenestorage – approximativequeries
      • Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) gettransformed to a set of TermQueries
      Dictionnary = { court, cours, courir }
      FuzzyQuery = cour
      TransformedQuery = court OR cours
    • www.xebia.fr / blog.xebia.fr
      27
      Inverted Index
    • www.xebia.fr / blog.xebia.fr
      28
      Lucene – Levenshtein distance
      • FuzzyQuery uses Levenshtein distance :
      • the number of modifications required to switchfrom one word to another
    • www.xebia.fr / blog.xebia.fr
      29
      Lucene - FuzzyQuery
      • Currentimplementation not optimal
      • LUCENE-2089 will use a Levenshteinautomaton
    • www.xebia.fr / blog.xebia.fr
      30
      Lucene – Highlighter
      • Produces ready to use HTML snippetswithhighlightedwordsfromquery
      • Can befullycustomized
      • By default limited to 50 KB characters
      • Uses FastVectorHighlighter for fasterresults (~2.5 times faster)
    • www.xebia.fr / blog.xebia.fr
      31
      Lucene – FieldCache
      • Lucene cache thatallows to store in memory values of a single field
      • Usedinternally by Sortobjects
      • Can beused to manuallyload values of a single field :
      float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);
    • www.xebia.fr / blog.xebia.fr
      32
      Lucene – MoreLikeThis
      • Findssimilar documents
      • Produces a query to besearched
      MoreLikeThismlt = new MoreLikeThis(reader);
      mlt.setFieldNames(new String[] {"title", "author"});
      mlt.setMinTermFreq(1);
      mlt.setMinDocFreq(1);
      Queryquery = mlt.like(docId);
      indexSearcher.search(query, 10);
    • www.xebia.fr / blog.xebia.fr
      33
      Lucene – FunctionQueries
      • Allows score customization
      • ConsiderusingFieldCaches to Reducefetchingcost
      FieldScoreQueryscoreQuery = new FieldScoreQuery("score",
      FieldScoreQuery.Type.BYTE);
      CustomScoreQuerycustomQ = new CustomScoreQuery(q, scoreQuery ) {
      public floatcustomScore(int doc,
      floatsubQueryScore,
      floatvalSrcScore) {
      return (float) (Math.sqrt(subQueryScore) * valSrcScore);
      }
      };
    • www.xebia.fr / blog.xebia.fr
      34
      Lucene – Luke
    • www.xebia.fr / blog.xebia.fr
      35
      Lucene – Global performance tuning
      • Considerusing SSD for lowlatency
      • ConsiderusingRAMDirectory / InstanciatedIndex
      • Uses latest version of Lucene
      • Uses NIODirectory for Unix and MMAPDirectory for Windows
      • Try to turn off setUseCompoundFile
    • www.xebia.fr / blog.xebia.fr
      36
      Lucene – Indexing performance tuning
      • Set RAMBufferSizeMBaccording to yourneeds
      • Tune yourmergepolicywith care
    • www.xebia.fr / blog.xebia.fr
      37
      Lucene – Search performance tuning
      • Open IndexReader in read-only mode (default in Lucene 2.9+)
      • WarmupFieldCache to ensureimmediateaccesswhensorting
      • Limit use of TermVector
      • Ensure index isoptimized
    • www.xebia.fr / blog.xebia.fr
      38
      Architecture withHibernateSearch
    • www.xebia.fr / blog.xebia.fr
      39
      Architecture withSolr
    • www.xebia.fr / blog.xebia.fr
      40
      Architecture withInfinispan
    • www.xebia.fr / blog.xebia.fr
      41
      Lucene – Distributed : Katta
      • Shards and distributesLucene index over instances
      • Uses Hadoop for distribution
    • www.xebia.fr / blog.xebia.fr
      42
      Lucenegalaxy
      • Apache Nutch : Lucene + Crawling and parsing
      • Apache Compass : Search engine framework
      • Apache Solr : Lucenestandalonesearch server
      • Apache Mahout : Distributed machine learning
      • HibernateSearch : Hibernate + Lucene
      • Katta : DistributedLucenewithHadoop
    • www.xebia.fr / blog.xebia.fr
      43
      Lucene - Futures
      • FlexBranch : makingLuceneeven more customizable
      • Apache Mahout : distributed machine learning for clustering, classification and recommendationalgorithms
    • www.xebia.fr / blog.xebia.fr
      44
      Questions ?