Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Judcon Brazil 2014 Lucene from the bottom up

701 views

Published on

Judcon Brazil 2014

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Judcon Brazil 2014 Lucene from the bottom up

  1. 1. Lucene from the bottom up ! Gustavo Fernandes

  2. 2. Ultra-fast, low memory footprint, high throughput apache licensed search library with support for incremental indexing, written in Java with several language ports Python, .NET, C++ What is Lucene
  3. 3. • Service
 
 • Database
 
 • Product
 What Lucene is not
  4. 4. Search
  5. 5. Search Battle with or against their favourite heroes and outlaws, or your own customised character Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US. GTAV for PS3 DC Universe Online for PS4 The Last of Us for PS4 Assassins Creed Black Flag for PS3 1 4 3 2
  6. 6. Index against and battle character customised dc favourite heroes online a assassins among and captain caribbean creed developed edward fearsome have is a and appearance by character creating criminal customising developing gta v her a across and brave brutal ellie girl hope if joel journey last must 1 4 3 2his in invest or potential ps3 start unique you your kenway lawless named outlaws pirate pirates ps3 republic rule the these young or outlaws own ps4 their universe with your of ps4 survive survivor teenage the their they to together us work young
  7. 7. Inverted Index across against among appearance battle brave brutal captain caribbean character creating criminal customised customising developed developing edward ellie favourite fearsome girl heroes hope invest joel journey kenway lawless must named outlaws own pirate pirates potential republic rule start survive survivor teenage together unique work young 4 3 2 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 1 4 4 4 4 4 4 4 4 4 4 4 1 4 2 1
  8. 8. Documents and Fields Id Console 1 PS3 You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance Title GTAV Description Id Console 2 PS4 Battle with or against their favourite heroes and outlaws, or your own customised character Title DC Universe Online Description Id Console 3 PS3 Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway Title Assassins Creed Description Id Console 4 PS4 Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US. Title The Last of Us Description
  9. 9. Fields across against among appearance battle brave brutal captain . . . republic rule start survive survivor teenage together unique work young Field: Description Field:Title Field: Console 4 3 2 2 1 4 4 3 3 3 4 4 4 4 1 4 3 4 1 assassins black creed dc flag gta last of online universe the us v 3 3 3 3 4 4 4 4 1 1 2 2 2 ps3 ps4 1 3 2 4 Field: Id 1 2 3 4 1 3 2 4
  10. 10. On Terms • Unit of search
 • Created by a process called tokenisation
 • Numerous ways of doing it
 • Language specific “gotchas”
  11. 11. Examples Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together. Joel a brutal survivor and Ellie a brave young teenage girl must work together Joel brutal survivor Ellie brave young teenage girl must work together joel brutal survivor ellie brave young teenage girl must work together joel brutal survivor survive ellie brave fearless young teenage teen-age girl must work together SynonymsStemming Synonyms
  12. 12. Examples (2) Coca-Cola improved the market share of the flagship brand Diet Coke by 0.4% to 42.4%
 coca cola improved market share flagship brand diet coke 0 4 42 4 私の名前はグスタボです 私の名前はグスタボです私の名前はグスタボです 私 の 名 前 は グ ス タ ボ で す
  13. 13. Phrase q=title:“black creed”q=description:”young teenage” republic rule start survive survivor teenage together unique work young Field: Description 3 3 4 4 4 4 1 4 3 4 1 Field:Title assassins black creed dc flag gta last of online universe the us v 3 3 3 3 4 4 4 4 1 1 2 2 2 11 2 2 19 4 10 14 8 13 18, 9 1 3 2 1 4 1 2 3 3 2 1 4 2
  14. 14. Autocomplete captain caribbean character captain caribbean character criminal customised teenage together unique work young Field: Description 3 3 4 1 4 3 4 1 19 4 9 15 13 10 14 8 13 18, 9 2 1 4 c
  15. 15. Autocomplete Finite State Transducer character captain captain, caribbean, character, criminal,young, your criminal
  16. 16. Relevance q=description:outlaws Id Console 2 PS4 Battle with or against their favourite heroes and outlaws, or your own customised character Title DC Universe Online Description Id Console 3 PS3 Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway Title Assassins Creed Description ? Dc Universe Online ? Assassins Creed Id Console 2 PS4 Battle with or against their favourite heroes and outlaws, or your own customised character Title DC Universe Online Description
  17. 17. Vector d1 d2 V 2 3 V=(2, 3) V=2 . d1 + 3 . d2 d1 d2 d3 2 3 V=(2, 3, 4) 4 V=2 . d1 + 3 . d2 + 4 . d3
  18. 18. Score- Vector Model • Result documents represented as vectors
 • Query represent as vector
 • Vectors dimensions are terms
 • Vector ‘quantities’ are Tf-Idf
 • Score = Cossine Similarity between query vector and document vector
 0.4024 Dc Universe Online 0.3219 Assassins Creed
  19. 19. Documents and Queries as vectors Id Console 2 PS4 Battle with or against their favourite heroes and outlwas, or your own customised character Title DC Universe Online Description Id Console 3 PS3 Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway Title Assassins Creed Description D2 = w21 . against + w22 . battle + … + w23 . outlaws + w2j . own D3 = w31 . among + w32 . captain + … + w35 . outlaws + … + w3j . young Q = wq . outlaws
  20. 20. Term Weights • Term frequency (Tf) : number of appearances of term in the doc
 • Inverse Document Frequency (Idf): 3 D3 = 1.6931 . among + 1.6931 . captain + … + 1.287 . outlaws + … + 1.287 . young TERM among across outlaws young sqrt(Tf) 1 0 1 1 docFreq 1 1 2 2 Idf 1.6931 1.6931 1.287 1.287 w 1.6931 0 1.287 1.287 nDocs = 4 Id Console 3 PS3 Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway Title Assassins Creed Description
  21. 21. Tf-Idf • The more a term appears in a document
 
 • The more rare a term is index-wide
  22. 22. Lucene API
  23. 23. Lucene API - Documents import org.apache.lucene.document.Document; import org.apache.lucene.document.IntField; import org.apache.lucene.document.TextField; ! Document doc = new Document(); ! doc.add(new IntField("id", 1, Store.YES)); doc.add(new TextField("console", "PS3", Store.YES)); doc.add(new TextField("title", "GTA V", Store.YES)); doc.add(new TextField("description", "You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance", Store.YES));
  24. 24. Lucene API - Analysis Name Type Analysis Id Number None Console String Lowercase Title Text WhiteSpace,
 Lowercase Description Text WhiteSpace,
 Lowercase, Remove commons words
 
 Description_jp Text Japanse Tokenizer Id Console 1 PS3 You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance Title GTAV Description Description_jp かつてないほど大規模でダイナミックな多様性 に富んだオープンワールドを誇る『グランド・ セフト・オートV』は、ストーリーテリングと ゲームプレイを新しい手法で融合。
  25. 25. Lucene API - Analysis rule pirates Pirates rule the Caribbean Whitespace Tokenizer Lowercase TokenFilter Stopwords TokenFilter Analyzer caribbean
  26. 26. Lucene API - Analysis Custom Analyzer public class MySimpleAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { ! WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(reader); LowerCaseFilter lcFilter = new LowerCaseFilter(keywordTokenizer); return new TokenStreamComponents(keywordTokenizer, lcFilter); ! } }
  27. 27. Lucene API - Analysis @Override protected TokenStreamComponents createComponents( String fieldName, Reader reader) { Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(stream, stoptags); stream = new CJKWidthFilter(stream); stream = new StopFilter(stream, stopwords); stream = new JapaneseKatakanaStemFilter(stream); stream = new LowerCaseFilter(stream); return new TokenStreamComponents(tokenizer, stream); } ! org.apache.lucene.analysis.ja.JapaneseAnalyzer
  28. 28. Lucene API - Analysis @Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { ! final StandardTokenizer src = new StandardTokenizer(getVersion(), reader); … TokenStream tok = new StandardFilter(getVersion(), src); tok = new LowerCaseFilter(getVersion(), tok); tok = new StopFilter(getVersion(), tok, stopwords); return new TokenStreamComponents(src, tok) ! } org.apache.lucene.analysis.standard.StandardAnalyzer
  29. 29. Lucene API - Indexing 1 Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>(); 2 analyzerMap.put("id", new KeywordAnalyzer()); 3 analyzerMap.put("console", new MySimpleAnalyzer()); 4 analyzerMap.put("description", new StandardAnalyzer()); 5 analyzerMap.put("description_jp", new JapaneseAnalyzer()); 6 7 PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new StandardAnalyzer(), analyzerMap); 8 9 Directory ramDirectory = new RAMDirectory(); 10 IndexWriterConfig iwc = new IndexWriterConfig(Version.LATEST, analyzer); 11 IndexWriter iw = new IndexWriter(ramDirectory, iwc); 12 for (Document document : documents) { 13 iw.addDocument(document); 14 } 15 iw.close();
  30. 30. Lucene API - Directory • RAMDirectory (for tests only)
 • FSDirectory • MMapDirectory (Default for 64bit) • SimpleFSDirectory (java.io.RandomAccessFile) • NIOFSDirectory (java.io.FileChannel) • WindowsDirectory (native requires a .dll) • NativeUnixDirectory (experimental)
 • InfinispanDirectory (3rd party)
  31. 31. Lucene API - Directory _0.fdt _0.fdx _0.fnm _0.nvd _0.nvm _0.si _0_Lucene41_0.doc _0_Lucene41_0.pos _0_Lucene41_0.tim _0_Lucene41_0.tip IndexWriter.close() IndexWriter.close() IndexWriter.close() _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1_Lucene41_0.doc _1_Lucene41_0.pos _1_Lucene41_0.tim _1_Lucene41_0.tip _2.fdt _2.fdx _2.fnm _2.nvd _2.nvm _2.si _2_Lucene41_0.doc _2_Lucene41_0.pos _2_Lucene41_0.tim _2_Lucene41_0.tip
  32. 32. Lucene API - Directory from http://blog.mikemccandless.com/
  33. 33. Lucene API - Autocomplete 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 AnalyzingSuggester suggester = new AnalyzingSuggester(new StandardAnalyzer()); 4 LuceneDictionary dictionary = new LuceneDictionary(reader, "description"); 5 suggester.build(dictionary); 6 7 List<Lookup.LookupResult> suggestions = suggester.lookup("c", false, 5); 8 9 for (Lookup.LookupResult suggestion : suggestions) { 10 System.out.println(suggestion.key); 11 } captain caribbean character creating criminal
  34. 34. Lucene API - Search 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %sn", scoreDoc.score, title); 12 } q = description:character 0.402401 - DC Universe Online 0.321921 - GTAV
  35. 35. Lucene API - Search q=description:”young teenage” 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 PhraseQuery query = new PhraseQuery(); 4 query.add(new Term("description","young")); 5 query.add(new Term("description","teenage")); 6 7 IndexSearcher indexSearcher = new IndexSearcher(reader); 8 TopDocs topDocs = indexSearcher.search(query, 10); 9 10 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 11 int internalId = scoreDoc.doc; 12 Document document = reader.document(internalId); 13 String title = document.get("title"); 14 System.out.printf("%f - %sn", scoreDoc.score, title); 15 } 0.745207 - The Last of Us
  36. 36. Lucene API - Search q = console:”PS3” AND (description:”pirate” OR description:”criminal”) 0.741689 - GTAV 0.741689 - Assassins Creed Black Flag 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery descriptionOne = new TermQuery(new Term("description", "pirate")); 4 TermQuery descriptionTwo = new TermQuery(new Term("description", "criminal")); 5 6 BooleanQuery descriptionQuery = new BooleanQuery(); 7 descriptionQuery.add(descriptionOne, BooleanClause.Occur.SHOULD); 8 descriptionQuery.add(descriptionTwo, BooleanClause.Occur.SHOULD); 9 10 TermQuery consoleQuery = new TermQuery(new Term("console", "ps3")); 11 12 BooleanQuery query = new BooleanQuery(); 13 query.add(consoleQuery, BooleanClause.Occur.MUST); 14 query.add(descriptionQuery, BooleanClause.Occur.MUST); 15 16 IndexSearcher indexSearcher = new IndexSearcher(reader); 17 TopDocs topDocs = indexSearcher.search(query, 10);
  37. 37. Lucene API - Search Query Parser 1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(query, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %sn", scoreDoc.score, title); 12 } 0.741689 - GTAV 0.741689 - Assassins Creed Black Flag
  38. 38. Lucene API - Sort NaN - Assassins Creed Black Flag NaN - GTAV 1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 6 Sort sort = new Sort(new SortField("title", SortField.Type.STRING, true)); 7 TopDocs topDocs = indexSearcher.search(query, 10, sort); 8 9 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 10 int internalId = scoreDoc.doc; 11 Document document = reader.document(internalId); 12 String title = document.get("title"); 13 System.out.printf("%f - %sn", scoreDoc.score, title); 14 }
  39. 39. Lucene API - Explain 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Explanation explanation = indexSearcher.explain(termQuery, internalId); 10 System.out.println(explanation); 11 } 0.40240064 = (MATCH) weight(description:character in 2) [DefaultSimilarity], result of: 0.40240064 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.3125 = fieldNorm(doc=2) ! 0.3219205 = (MATCH) weight(description:character in 0) [DefaultSimilarity], result of: 0.3219205 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.25 = fieldNorm(doc=0)
  40. 40. Reviews provided by ign.com

×