Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Illuminating Lucene.Net:Bringing Full-Text Search to LightW. Dean Thrasher14 May 2013
Agenda• About the presenter• About Lucene.Net– What it is– What it does– How it works– Who uses it– Why you should care
More Agenda• Core concepts– Lucene structure– Luke– Terminology• Code examples• Things to know• Recap• References
W. Dean ThrasherDean.thrasher@infovark.comwww.infovark.comwww.linkedin.com/in/deanthrasher@DThrasher@infovark
BACKGROUNDIlluminating Lucene.Net
What is Lucene.Net?Lucene.Net is a port of the Lucene search enginelibrary, written in C# and targeted at .NETruntime users.
What is Lucene?Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java.Apach...
History1997 – Lucene project began by Doug Cutting2000 – First open source release2002 – First Apache Jakarta release2005 ...
Why you should careYou want to providecustomers with a“Google-like” searchexperienceYou want to tuneincoming queries orres...
What does it do?• Allows you to index and search vast amountsof text quickly• Provides a powerful query syntax• Integrates...
How it works• Lucene uses an inverted index– Maps terms to the documents that contain them• Lucene manages its index– Stor...
Who uses Lucene.Net?• Stackoverflow• RavenDB• Sitecore• Orchard• MindTouch• Umbraco• Sitefinity• SubText
CONCEPTSIlluminating Lucene.Net
Differences between Java and .NetThe Lucene.Net API:• Lags a few steps behind the Java version ofLucene• Takes advantage o...
Logical Index Storage• Field – a name/value pair• Document – a sequence of fields• Index – a collection of documents
Physical Index Storage• Lucene generates aseries of files within asingle directory• Moving an index is acopy-and-pasteoper...
Luke• Lucene Index Toolbox• Built in Java, but canread Lucene.Netindexes• http://code.google.com/p/luke/
Analyzers and Tokens• Analyzers take strings of text and break theminto tokens• Tokens are chunks of text and associatedme...
Terms, Queries and Hits• Terms – the basic unit for searching. A fieldname and a value to seek.• Queries – combine terms t...
CODE EXAMPLES
Create documents demo• IndexWriter• Directory• Analyzer• Document• Field
Read documents demo• IndexReader• Term• Query• Hits
Update documents demo• IndexWriter• Document• Term
Delete documents demo• IndexWriter• Query• Term
Search demo• IndexSearcher• QueryParser• Query• Term• TopDocs• ScoreDoc
THINGS TO KNOWIlluminating Lucene.Net
Transactional Lucene• Lucene supports ACID commits to its indexes• Lucene uses the Commit and Rollback syntax,much like re...
Lucene index typesFSDirectory• Stores indexed documentson disk• Persists data across sessions• Best choice for mostapplica...
Precalculation• How you store things in Lucene matters –choose field options and analyzers carefully• The way you retrieve...
Field.StoreYes – stores the text in its original formNo – the original text is not preserved
Field.Index• No – the field is not indexed, so it is notsearchable• Not analyzed – the text is treated as singleunit and i...
Field.TermVector• No – Does not store term vectors• Yes – Stores the term vectors of eachdocument (terms and number of occ...
Field types indexing optionsField Stored Analyzed VectoredId Yes Not analyzed NoModified Yes Not analyzed NoPath Yes Analy...
Analyzers• Break apart text into tokens; each token getsindexed separately• Remove stop words• Decide how to handle punctu...
Types of Queries• TermQuery• PhraseQuery• RangeQuery• PrefixQuery , Wildcard Query• FuzzyQuery• Use BooleanQuery to combin...
Query syntaxQuery Type Purpose SampleTermQuery Single word query scarlettPhraseQuery Matches terms in order “frankly my de...
Query, Filter, and Sort• Lucene.Net can handle all three• Default sort is by relevance• Prefer queries to filters – they p...
Using Dispose()
Linq Providers• LINQ to Lucene• http://linqtolucene.codeplex.com/• Lucene.Net.Linq• https://github.com/themotleyfool/Lucen...
Recap• Why would I use a search engine?• Why would I use Lucene.Net?• How would I add Lucene.Net to my project?– Web– Desk...
REFERENCESIlluminating Lucene.Net
Web References• Lucene.Net – http://lucenenet.apache.org• Solr – http://lucene.apache.org/solr• Wikipedia– http://en.wikip...
Books• Lucene in Action,Second Edition• Michael McCandless,Erick Hatcher, OtisGospodnetić• Manning Publications• July 2010...
Books• Taming Text• Grant S. Ingersoll,Thomas S. Morton,Andrew L. Farris• Manning Publications• January 2013• http://www.m...
Books• Introduction toInformation Retrieval• Christopher D. Manning,Prabhakar Raghavan,Hinrich Schutze• Cambridge Universi...
Presentations• http://www.slideshare.net/nitin_stephens/lucene-basics
Blogs• http://blog.mikemccandless.com/
Sample FilesAll the literature shown in the code samplescomes from Project Gutenberg.http://www.gutenberg.org/
Upcoming SlideShare
Loading in …5
×

Illuminating Lucene.Net

2,891 views

Published on

An introduction to Lucene.Net delivered by W. Dean Thrasher to the Washington, DC Alt.Net meetup on May 14, 2013

Published in: Technology

Illuminating Lucene.Net

  1. 1. Illuminating Lucene.Net:Bringing Full-Text Search to LightW. Dean Thrasher14 May 2013
  2. 2. Agenda• About the presenter• About Lucene.Net– What it is– What it does– How it works– Who uses it– Why you should care
  3. 3. More Agenda• Core concepts– Lucene structure– Luke– Terminology• Code examples• Things to know• Recap• References
  4. 4. W. Dean ThrasherDean.thrasher@infovark.comwww.infovark.comwww.linkedin.com/in/deanthrasher@DThrasher@infovark
  5. 5. BACKGROUNDIlluminating Lucene.Net
  6. 6. What is Lucene.Net?Lucene.Net is a port of the Lucene search enginelibrary, written in C# and targeted at .NETruntime users.
  7. 7. What is Lucene?Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java.Apache Lucene is an open source projectavailable for free download.
  8. 8. History1997 – Lucene project began by Doug Cutting2000 – First open source release2002 – First Apache Jakarta release2005 – Lucene becomes a top-level project2006 – Lucene.Net gets Apache incubation status2010 – Lucene.Net orphaned by original committers2011 – Lucene.Net reaccepted into Apache Incubator2012 – Lucene.Net graduates from the Incubator
  9. 9. Why you should careYou want to providecustomers with a“Google-like” searchexperienceYou want to tuneincoming queries orresults rankingYou want betterperformance than SQL“like” searchesYou want to avoiddeploying a separatesearch tool with yourwebsite or application
  10. 10. What does it do?• Allows you to index and search vast amountsof text quickly• Provides a powerful query syntax• Integrates into applications easily
  11. 11. How it works• Lucene uses an inverted index– Maps terms to the documents that contain them• Lucene manages its index– Stores the index in memory or on disk– Allows documents to be added or removed• Makes an index for each document• Merges the index with a set of other indices
  12. 12. Who uses Lucene.Net?• Stackoverflow• RavenDB• Sitecore• Orchard• MindTouch• Umbraco• Sitefinity• SubText
  13. 13. CONCEPTSIlluminating Lucene.Net
  14. 14. Differences between Java and .NetThe Lucene.Net API:• Lags a few steps behind the Java version ofLucene• Takes advantage of advanced .Net features notfound in JavaBut it:• Preserves the core Lucene concepts• Maintains indexes that are compatible with theJava version
  15. 15. Logical Index Storage• Field – a name/value pair• Document – a sequence of fields• Index – a collection of documents
  16. 16. Physical Index Storage• Lucene generates aseries of files within asingle directory• Moving an index is acopy-and-pasteoperation• You can compress or zipan index to archive it
  17. 17. Luke• Lucene Index Toolbox• Built in Java, but canread Lucene.Netindexes• http://code.google.com/p/luke/
  18. 18. Analyzers and Tokens• Analyzers take strings of text and break theminto tokens• Tokens are chunks of text and associatedmetadata
  19. 19. Terms, Queries and Hits• Terms – the basic unit for searching. A fieldname and a value to seek.• Queries – combine terms to form searchcriteria• Hits – a ranked list of pointers to documents
  20. 20. CODE EXAMPLES
  21. 21. Create documents demo• IndexWriter• Directory• Analyzer• Document• Field
  22. 22. Read documents demo• IndexReader• Term• Query• Hits
  23. 23. Update documents demo• IndexWriter• Document• Term
  24. 24. Delete documents demo• IndexWriter• Query• Term
  25. 25. Search demo• IndexSearcher• QueryParser• Query• Term• TopDocs• ScoreDoc
  26. 26. THINGS TO KNOWIlluminating Lucene.Net
  27. 27. Transactional Lucene• Lucene supports ACID commits to its indexes• Lucene uses the Commit and Rollback syntax,much like relational databases.• Source:http://blog.mikemccandless.com/2012/03/transactional-lucene.html
  28. 28. Lucene index typesFSDirectory• Stores indexed documentson disk• Persists data across sessions• Best choice for mostapplicationsYour first choiceRAMDirectory• Stores indexed documentsin memory• Entire index must fit intoavailable memory• Does not persist data• Faster than FSDirectoryUseful for unit testing
  29. 29. Precalculation• How you store things in Lucene matters –choose field options and analyzers carefully• The way you retrieve information determineshow it should be stored• Smaller indexes give you better performance
  30. 30. Field.StoreYes – stores the text in its original formNo – the original text is not preserved
  31. 31. Field.Index• No – the field is not indexed, so it is notsearchable• Not analyzed – the text is treated as singleunit and indexed whole• Analyzed – the text is broken down intotokens and indexed
  32. 32. Field.TermVector• No – Does not store term vectors• Yes – Stores the term vectors of eachdocument (terms and number of occurrences)• With Positions Offsets – Term vector, tokenposition and offset information
  33. 33. Field types indexing optionsField Stored Analyzed VectoredId Yes Not analyzed NoModified Yes Not analyzed NoPath Yes Analyzed NoContent No Analyzed With Positions OffsetsAn example of storing fields related to files onyour computer.
  34. 34. Analyzers• Break apart text into tokens; each token getsindexed separately• Remove stop words• Decide how to handle punctuation• Handle languages and case sensitivity• You can create your own by building fromscratch or chaining exiting analyzers
  35. 35. Types of Queries• TermQuery• PhraseQuery• RangeQuery• PrefixQuery , Wildcard Query• FuzzyQuery• Use BooleanQuery to combine them
  36. 36. Query syntaxQuery Type Purpose SampleTermQuery Single word query scarlettPhraseQuery Matches terms in order “frankly my dear”RangeQuery Matches documents between theterms[1861 to 1865]{1861 to 1865}WildcardQuery Lightweight regex-like term matching Atl*D?m?PrefixQuery Matches terms that being with thestringWar*FuzzyQuery Closeness matching cry~BooleanQuery Combines other queries into complexexpressionsScarlett AND “frankly mydear” -voldemort
  37. 37. Query, Filter, and Sort• Lucene.Net can handle all three• Default sort is by relevance• Prefer queries to filters – they perform better
  38. 38. Using Dispose()
  39. 39. Linq Providers• LINQ to Lucene• http://linqtolucene.codeplex.com/• Lucene.Net.Linq• https://github.com/themotleyfool/Lucene.Net.Linq• Chris Eldredge• MotleyFool
  40. 40. Recap• Why would I use a search engine?• Why would I use Lucene.Net?• How would I add Lucene.Net to my project?– Web– Desktop• Where could I go to learn more?• When can I buy Dean a beer?
  41. 41. REFERENCESIlluminating Lucene.Net
  42. 42. Web References• Lucene.Net – http://lucenenet.apache.org• Solr – http://lucene.apache.org/solr• Wikipedia– http://en.wikipedia.org/wiki/Lucene– http://en.wikipedia.org/wiki/Search_engine_indexing• Academic discussions– http://lucene.sourceforge.net/talks/pisa/– http://lucene.sourceforge.net/talks/inktomi/
  43. 43. Books• Lucene in Action,Second Edition• Michael McCandless,Erick Hatcher, OtisGospodnetić• Manning Publications• July 2010• http://www.manning.com/hatcher3/
  44. 44. Books• Taming Text• Grant S. Ingersoll,Thomas S. Morton,Andrew L. Farris• Manning Publications• January 2013• http://www.manning.com/ingersoll/
  45. 45. Books• Introduction toInformation Retrieval• Christopher D. Manning,Prabhakar Raghavan,Hinrich Schutze• Cambridge University Press• 2008• http://www-nlp.stanford.edu/IR-book/
  46. 46. Presentations• http://www.slideshare.net/nitin_stephens/lucene-basics
  47. 47. Blogs• http://blog.mikemccandless.com/
  48. 48. Sample FilesAll the literature shown in the code samplescomes from Project Gutenberg.http://www.gutenberg.org/

×