Full Text SearchDavid LeBerAlign Software Inc.
What is full text search?
How?•   Wild card database queries•   Database implementations•   Third party search engines•   Text indexing libraries
Wild Card QueriesSELECT FROM SOME_TABLE WHERE SOME_COLUMN LIKE %Some String%
Wild Card Queries•   Easy
Wild Card Queries•   Slow•   Hard to optimize•   Difficult to rank
Database Implementations•   MySQL FULLTEXT index and MATCH queries•   PostgreSQL tsvector & tsquery
Database Implementations•   Fairly Easy
Database Implementations•   Database specific SQL•   May include additional limitations    (i.e: MySQL - MyISAM tables only...
Third Party Search Engines•   Google indexing / searching of your content
Third Party Search Engines•   Easy•   Matches user expectations
Third Party Search Engines•   Content must be available for indexing•   Loss of control•   Enhances the Google hegemony
Text Indexing Library•   Lucene
Text Indexing Library•   Complete control•   Database independent•   Flexible search behaviour•   Ranked results
Text Indexing Library•   Adds complexity•   Additional query language•   Parallel index
Lucene Overview•   Open Source - part of the Apache Project•   Very flexible•   Wickedly fast•   Index based
Lucene : Installing•   Add the Lucene jars to your classpath•   Use ERIndexing
Lucene : Tasks•   Indexing•   Searching
Indexing
What is Indexing?
Indexing : Steps•   Conversion (to plain text)•   Analysis (clean and convert the text to tokens)•   Index (save the token...
Indexing : Parts•   Index - either file or memory based•   Document - represents a unique object added to the index•   Fiel...
Indexing : Classes•   IndexWriter•   Directory•   Analyzer•   Document•   Field
Creating an IndexURL indexDirectoryURL = ... // assume existsFile indexFile = new File(indexDirectoryURL.getPath());FSDire...
Indexing : Field Parameters•   Stored or not•   Analyzed or not, with and without norms•   Include position, offset, and t...
Indexing : Analyzers•   SimpleAnalyzer•   StopAnalyzer•   StandardAnalyzer•   ...
Adding a DocumentString value = ... // assume existsDocument doc = new Document();Field docField = new Field("title", valu...
Indexing : Fun with indexes•   Multiple Access
Searching
What is Searching
Searching : Steps•   Clean the user input•   Create a Query•   Query the Index•   Return the results
Searching : Search Classes•   IndexReader•   IndexSearcher•   Query•   QueryParser•   TopDocs/ScoreDocs•   Document
Searching : QueryTypes•   TermQuery•   RangeQuery•   PrefixQuery•   BooleanQuery•   PhraseQuery•   WildCardQuery•   FuzzyQu...
Searching : QueryParser•   webobjects - contains an exact match - TermQuery•   webobjects apple, webobjects OR apple - an ...
Searching : QueryParser•   title:"apple webobjects" - Phrase Query•   title:"apple webobjects"~5 - slop of 5•   webobj* - ...
Performing a SearchQuery q = ... // assume existsIndexSearcher searcher = new IndexSearcher(index, true);TopScoreDocCollec...
Using a QueryParserQueryParser queryParser = new QueryParser(Version.LUCENE_2.9,                                          ...
Demo
Scoring
“The more times a query term appears in adocument relative to the number of times the term appears in all the documents in...
Boost•   While Indexing    •   Document    •   Field•   While Searching    •   Query
Luke
Demo
ERIndexing
ERIndexing : Strengths•   Hides some of the complexity of integrating Lucene with WO•   Offers lots of utility and helper ...
ERIndexing : Weaknesses•   Hides some of the complexity of integrating Lucene with WO•   Not fully baked•   Auto indexing ...
Demo
Beyond Lucene•   Solr•   Compass•   ElasticSearch
Q&ALucene: http://lucene.apache.orgLuke: http://code.google.com/p/luke/Solr: http://lucene.apache.org/solr/Compass: http:/...
Full Text Search with Lucene
Upcoming SlideShare
Loading in …5
×

Full Text Search with Lucene

7,428 views

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,428
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
159
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Full Text Search with Lucene

  1. 1. Full Text SearchDavid LeBerAlign Software Inc.
  2. 2. What is full text search?
  3. 3. How?• Wild card database queries• Database implementations• Third party search engines• Text indexing libraries
  4. 4. Wild Card QueriesSELECT FROM SOME_TABLE WHERE SOME_COLUMN LIKE %Some String%
  5. 5. Wild Card Queries• Easy
  6. 6. Wild Card Queries• Slow• Hard to optimize• Difficult to rank
  7. 7. Database Implementations• MySQL FULLTEXT index and MATCH queries• PostgreSQL tsvector & tsquery
  8. 8. Database Implementations• Fairly Easy
  9. 9. Database Implementations• Database specific SQL• May include additional limitations (i.e: MySQL - MyISAM tables only)• Functionality define by the DB engine
  10. 10. Third Party Search Engines• Google indexing / searching of your content
  11. 11. Third Party Search Engines• Easy• Matches user expectations
  12. 12. Third Party Search Engines• Content must be available for indexing• Loss of control• Enhances the Google hegemony
  13. 13. Text Indexing Library• Lucene
  14. 14. Text Indexing Library• Complete control• Database independent• Flexible search behaviour• Ranked results
  15. 15. Text Indexing Library• Adds complexity• Additional query language• Parallel index
  16. 16. Lucene Overview• Open Source - part of the Apache Project• Very flexible• Wickedly fast• Index based
  17. 17. Lucene : Installing• Add the Lucene jars to your classpath• Use ERIndexing
  18. 18. Lucene : Tasks• Indexing• Searching
  19. 19. Indexing
  20. 20. What is Indexing?
  21. 21. Indexing : Steps• Conversion (to plain text)• Analysis (clean and convert the text to tokens)• Index (save the tokens to the index)
  22. 22. Indexing : Parts• Index - either file or memory based• Document - represents a unique object added to the index• Field - identifies a chunk of data in the document
  23. 23. Indexing : Classes• IndexWriter• Directory• Analyzer• Document• Field
  24. 24. Creating an IndexURL indexDirectoryURL = ... // assume existsFile indexFile = new File(indexDirectoryURL.getPath());FSDirectory indexDirectory = FSDirectory.open(indexFile);StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);IndexWriter indexWriter = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
  25. 25. Indexing : Field Parameters• Stored or not• Analyzed or not, with and without norms• Include position, offset, and term frequency
  26. 26. Indexing : Analyzers• SimpleAnalyzer• StopAnalyzer• StandardAnalyzer• ...
  27. 27. Adding a DocumentString value = ... // assume existsDocument doc = new Document();Field docField = new Field("title", value, Field.Store.YES, Field.Index.ANALYZED);doc.add(docField);...indexWriter.addDocument(doc);
  28. 28. Indexing : Fun with indexes• Multiple Access
  29. 29. Searching
  30. 30. What is Searching
  31. 31. Searching : Steps• Clean the user input• Create a Query• Query the Index• Return the results
  32. 32. Searching : Search Classes• IndexReader• IndexSearcher• Query• QueryParser• TopDocs/ScoreDocs• Document
  33. 33. Searching : QueryTypes• TermQuery• RangeQuery• PrefixQuery• BooleanQuery• PhraseQuery• WildCardQuery• FuzzyQuery
  34. 34. Searching : QueryParser• webobjects - contains an exact match - TermQuery• webobjects apple, webobjects OR apple - an OR Query• +webobjects +apple / webobjects AND apple - an AND Query• title:webobjects - Contains the term in title field• title:webobjects -subject:iTunes / title:webobjects AND NOT subject:iTunes• (webobjects OR apple) AND iTunes
  35. 35. Searching : QueryParser• title:"apple webobjects" - Phrase Query• title:"apple webobjects"~5 - slop of 5• webobj* - Prefix Query• webobjicts~ - Fuzzy Query• lastmodified:[1/1/10 TO 1/1/11] - Range Query
  36. 36. Performing a SearchQuery q = ... // assume existsIndexSearcher searcher = new IndexSearcher(index, true);TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);searcher.search(query, collector);ScoreDoc[] hits = collector.topDocs().scoreDocs;
  37. 37. Using a QueryParserQueryParser queryParser = new QueryParser(Version.LUCENE_2.9, "content", analyzer);Query query = queryParser.parse(queryString);
  38. 38. Demo
  39. 39. Scoring
  40. 40. “The more times a query term appears in adocument relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query”
  41. 41. Boost• While Indexing • Document • Field• While Searching • Query
  42. 42. Luke
  43. 43. Demo
  44. 44. ERIndexing
  45. 45. ERIndexing : Strengths• Hides some of the complexity of integrating Lucene with WO• Offers lots of utility and helper methods• Speaks WebObjects collection classes• Simplifies index creation
  46. 46. ERIndexing : Weaknesses• Hides some of the complexity of integrating Lucene with WO• Not fully baked• Auto indexing may be dangerous
  47. 47. Demo
  48. 48. Beyond Lucene• Solr• Compass• ElasticSearch
  49. 49. Q&ALucene: http://lucene.apache.orgLuke: http://code.google.com/p/luke/Solr: http://lucene.apache.org/solr/Compass: http://www.compass-project.org/overview.htmlElasticSearch: http://www.elasticsearch.com/

×