Your SlideShare is downloading. ×
Full Text Search with Lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Full Text Search with Lucene

5,226
views

Published on

Published in: Technology, Business

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,226
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
126
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Full Text SearchDavid LeBerAlign Software Inc.
  • 2. What is full text search?
  • 3. How?• Wild card database queries• Database implementations• Third party search engines• Text indexing libraries
  • 4. Wild Card QueriesSELECT FROM SOME_TABLE WHERE SOME_COLUMN LIKE %Some String%
  • 5. Wild Card Queries• Easy
  • 6. Wild Card Queries• Slow• Hard to optimize• Difficult to rank
  • 7. Database Implementations• MySQL FULLTEXT index and MATCH queries• PostgreSQL tsvector & tsquery
  • 8. Database Implementations• Fairly Easy
  • 9. Database Implementations• Database specific SQL• May include additional limitations (i.e: MySQL - MyISAM tables only)• Functionality define by the DB engine
  • 10. Third Party Search Engines• Google indexing / searching of your content
  • 11. Third Party Search Engines• Easy• Matches user expectations
  • 12. Third Party Search Engines• Content must be available for indexing• Loss of control• Enhances the Google hegemony
  • 13. Text Indexing Library• Lucene
  • 14. Text Indexing Library• Complete control• Database independent• Flexible search behaviour• Ranked results
  • 15. Text Indexing Library• Adds complexity• Additional query language• Parallel index
  • 16. Lucene Overview• Open Source - part of the Apache Project• Very flexible• Wickedly fast• Index based
  • 17. Lucene : Installing• Add the Lucene jars to your classpath• Use ERIndexing
  • 18. Lucene : Tasks• Indexing• Searching
  • 19. Indexing
  • 20. What is Indexing?
  • 21. Indexing : Steps• Conversion (to plain text)• Analysis (clean and convert the text to tokens)• Index (save the tokens to the index)
  • 22. Indexing : Parts• Index - either file or memory based• Document - represents a unique object added to the index• Field - identifies a chunk of data in the document
  • 23. Indexing : Classes• IndexWriter• Directory• Analyzer• Document• Field
  • 24. Creating an IndexURL indexDirectoryURL = ... // assume existsFile indexFile = new File(indexDirectoryURL.getPath());FSDirectory indexDirectory = FSDirectory.open(indexFile);StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);IndexWriter indexWriter = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
  • 25. Indexing : Field Parameters• Stored or not• Analyzed or not, with and without norms• Include position, offset, and term frequency
  • 26. Indexing : Analyzers• SimpleAnalyzer• StopAnalyzer• StandardAnalyzer• ...
  • 27. Adding a DocumentString value = ... // assume existsDocument doc = new Document();Field docField = new Field("title", value, Field.Store.YES, Field.Index.ANALYZED);doc.add(docField);...indexWriter.addDocument(doc);
  • 28. Indexing : Fun with indexes• Multiple Access
  • 29. Searching
  • 30. What is Searching
  • 31. Searching : Steps• Clean the user input• Create a Query• Query the Index• Return the results
  • 32. Searching : Search Classes• IndexReader• IndexSearcher• Query• QueryParser• TopDocs/ScoreDocs• Document
  • 33. Searching : QueryTypes• TermQuery• RangeQuery• PrefixQuery• BooleanQuery• PhraseQuery• WildCardQuery• FuzzyQuery
  • 34. Searching : QueryParser• webobjects - contains an exact match - TermQuery• webobjects apple, webobjects OR apple - an OR Query• +webobjects +apple / webobjects AND apple - an AND Query• title:webobjects - Contains the term in title field• title:webobjects -subject:iTunes / title:webobjects AND NOT subject:iTunes• (webobjects OR apple) AND iTunes
  • 35. Searching : QueryParser• title:"apple webobjects" - Phrase Query• title:"apple webobjects"~5 - slop of 5• webobj* - Prefix Query• webobjicts~ - Fuzzy Query• lastmodified:[1/1/10 TO 1/1/11] - Range Query
  • 36. Performing a SearchQuery q = ... // assume existsIndexSearcher searcher = new IndexSearcher(index, true);TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);searcher.search(query, collector);ScoreDoc[] hits = collector.topDocs().scoreDocs;
  • 37. Using a QueryParserQueryParser queryParser = new QueryParser(Version.LUCENE_2.9, "content", analyzer);Query query = queryParser.parse(queryString);
  • 38. Demo
  • 39. Scoring
  • 40. “The more times a query term appears in adocument relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query”
  • 41. Boost• While Indexing • Document • Field• While Searching • Query
  • 42. Luke
  • 43. Demo
  • 44. ERIndexing
  • 45. ERIndexing : Strengths• Hides some of the complexity of integrating Lucene with WO• Offers lots of utility and helper methods• Speaks WebObjects collection classes• Simplifies index creation
  • 46. ERIndexing : Weaknesses• Hides some of the complexity of integrating Lucene with WO• Not fully baked• Auto indexing may be dangerous
  • 47. Demo
  • 48. Beyond Lucene• Solr• Compass• ElasticSearch
  • 49. Q&ALucene: http://lucene.apache.orgLuke: http://code.google.com/p/luke/Solr: http://lucene.apache.org/solr/Compass: http://www.compass-project.org/overview.htmlElasticSearch: http://www.elasticsearch.com/