Illuminating Lucene.Net:Bringing Full-Text Search to LightW. Dean Thrasher14 May 2013
Agenda• About the presenter• About Lucene.Net– What it is– What it does– How it works– Who uses it– Why you should care
More Agenda• Core concepts– Lucene structure– Luke– Terminology• Code examples• Things to know• Recap• References
W. Dean ThrasherDean.thrasher@infovark.comwww.infovark.comwww.linkedin.com/in/deanthrasher@DThrasher@infovark
BACKGROUNDIlluminating Lucene.Net
What is Lucene.Net?Lucene.Net is a port of the Lucene search enginelibrary, written in C# and targeted at .NETruntime users.
What is Lucene?Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java.Apach...
History1997 – Lucene project began by Doug Cutting2000 – First open source release2002 – First Apache Jakarta release2005 ...
Why you should careYou want to providecustomers with a“Google-like” searchexperienceYou want to tuneincoming queries orres...
What does it do?• Allows you to index and search vast amountsof text quickly• Provides a powerful query syntax• Integrates...
How it works• Lucene uses an inverted index– Maps terms to the documents that contain them• Lucene manages its index– Stor...
Who uses Lucene.Net?• Stackoverflow• RavenDB• Sitecore• Orchard• MindTouch• Umbraco• Sitefinity• SubText
CONCEPTSIlluminating Lucene.Net
Differences between Java and .NetThe Lucene.Net API:• Lags a few steps behind the Java version ofLucene• Takes advantage o...
Logical Index Storage• Field – a name/value pair• Document – a sequence of fields• Index – a collection of documents
Physical Index Storage• Lucene generates aseries of files within asingle directory• Moving an index is acopy-and-pasteoper...
Luke• Lucene Index Toolbox• Built in Java, but canread Lucene.Netindexes• http://code.google.com/p/luke/
Analyzers and Tokens• Analyzers take strings of text and break theminto tokens• Tokens are chunks of text and associatedme...
Terms, Queries and Hits• Terms – the basic unit for searching. A fieldname and a value to seek.• Queries – combine terms t...
CODE EXAMPLES
Create documents demo• IndexWriter• Directory• Analyzer• Document• Field
Read documents demo• IndexReader• Term• Query• Hits
Update documents demo• IndexWriter• Document• Term
Delete documents demo• IndexWriter• Query• Term
Search demo• IndexSearcher• QueryParser• Query• Term• TopDocs• ScoreDoc
THINGS TO KNOWIlluminating Lucene.Net
Transactional Lucene• Lucene supports ACID commits to its indexes• Lucene uses the Commit and Rollback syntax,much like re...
Lucene index typesFSDirectory• Stores indexed documentson disk• Persists data across sessions• Best choice for mostapplica...
Precalculation• How you store things in Lucene matters –choose field options and analyzers carefully• The way you retrieve...
Field.StoreYes – stores the text in its original formNo – the original text is not preserved
Field.Index• No – the field is not indexed, so it is notsearchable• Not analyzed – the text is treated as singleunit and i...
Field.TermVector• No – Does not store term vectors• Yes – Stores the term vectors of eachdocument (terms and number of occ...
Field types indexing optionsField Stored Analyzed VectoredId Yes Not analyzed NoModified Yes Not analyzed NoPath Yes Analy...
Analyzers• Break apart text into tokens; each token getsindexed separately• Remove stop words• Decide how to handle punctu...
Types of Queries• TermQuery• PhraseQuery• RangeQuery• PrefixQuery , Wildcard Query• FuzzyQuery• Use BooleanQuery to combin...
Query syntaxQuery Type Purpose SampleTermQuery Single word query scarlettPhraseQuery Matches terms in order “frankly my de...
Query, Filter, and Sort• Lucene.Net can handle all three• Default sort is by relevance• Prefer queries to filters – they p...
Using Dispose()
Linq Providers• LINQ to Lucene• http://linqtolucene.codeplex.com/• Lucene.Net.Linq• https://github.com/themotleyfool/Lucen...
Recap• Why would I use a search engine?• Why would I use Lucene.Net?• How would I add Lucene.Net to my project?– Web– Desk...
REFERENCESIlluminating Lucene.Net
Web References• Lucene.Net – http://lucenenet.apache.org• Solr – http://lucene.apache.org/solr• Wikipedia– http://en.wikip...
Books• Lucene in Action,Second Edition• Michael McCandless,Erick Hatcher, OtisGospodnetić• Manning Publications• July 2010...
Books• Taming Text• Grant S. Ingersoll,Thomas S. Morton,Andrew L. Farris• Manning Publications• January 2013• http://www.m...
Books• Introduction toInformation Retrieval• Christopher D. Manning,Prabhakar Raghavan,Hinrich Schutze• Cambridge Universi...
Presentations• http://www.slideshare.net/nitin_stephens/lucene-basics
Blogs• http://blog.mikemccandless.com/
Sample FilesAll the literature shown in the code samplescomes from Project Gutenberg.http://www.gutenberg.org/
Upcoming SlideShare
Loading in...5
×

Illuminating Lucene.Net

2,371

Published on

An introduction to Lucene.Net delivered by W. Dean Thrasher to the Washington, DC Alt.Net meetup on May 14, 2013

Published in: Technology
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
2,371
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
37
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Egad, the PUNishment! Well, at least I didn’t have a boring “Introduction to Lucene.NET” title.
  • Oooh, an agenda. Aren’t I organized?
  • Please send me an email to get in touch with me. Keep up with what I’m doing on the Infovark website or on my LinkedIn profile. I’ve listed my twitter handles – personal and work – but I rarely log into Twitter for any length of time. Send me a private message if you want to get my attention on Twitter.
  • Doug Cutting had written search engines in other languages, but he wanted to teach himself Java. So the Lucene project began. Although he started building a commercial venture around the project, he decided that he preferred writing code to running a business. He open sourced the code in 2000.Lucene got adopted by the Apache Software foundation in 2001. Lucene.Net, which began as an independent port of Lucene, was accepted by the ASF in 2006.In 2010, Lucene.Net hit a rough patch, but thatnk’s to the efforts of the Alt.Net community, it was reintroduced to the Apache Incubator. In 2012, it graduated from the Incubator and became a full-fledged Apache project.
  • Inverted indexMaps terms to the documents that contain themTerms may include metadata to improve rankingTerms may include position data for proximity searches
  • These are a few examples of websites, applications, and platforms that use Lucene.Net. If I included those that use Lucene, the Java version, the list would be huge. Even if you don’t use Lucene.Net directly, chances are good that you use something that does. Lucene has become a foundational technology for many of the tools and sites we use today, but not many folks working on the Microsoft side are familiar with it. Some prominent Java examples include: LinkedIn, Twitter, IBM’s OpenFind, and many more.
  • The .Net version is catching up with the Java version, but it remains nearly a full version behind.The .Net API is much nicer to work with, having good collections and generics support.Tools that interact with a Lucene index will work regardless of the Lucene library that created it.
  • Although we’ll be working with the Lucene.NET API tonight, many of the concepts you’ll hear will apply to any search engine, though the specific terminology may differ a little. Let’s review some basic definitions we’ll use throughout the rest of the presentation.Index – a collection of documentsDocument – a sequence of fieldsField – a string name/value pair
  • Luke is one of the ugliest applications I’ve ever seen, but it’s extremely useful. It exposes just about every aspect of the Lucene API, so it makes a great test-bed for trying out different ideas.
  • Analyzer – breaks field values into tokensToken – a tuple consisting of a chunk of text and its associated metadata. Tokens are the raw bits that gets indexed.(Tokens and terms are closely related.)
  • Query – a way to ask a question of an indexTerm – a tuple containing a field and a value to seek
  • Here are some of the key classes used to add documents to the index.I really ought to add some details to the slide for folks who can’t see the code sample.
  • Updating is a fairly new operation in the Lucene.Net API. Under the hood, it’s doing a Delete operation then an Add operation.
  • Did you know that you can use an IndexReader to update and delete documents, too? Yes, but I don’t recommend it. This is one of the parts of the API that’s getting revised in the near future.
  • Unlike a relational database, there’s no “normal form” to guide you when structuring a Lucene index. The key thing to remember is that the
  • Keeping the original text within the Lucene index is convenient, but can vastly increase the size of your indexes.
  • Term Vector Yes
  • Just an example of how you might combine the flags when adding fields to a document.
  • TermQuery – retrieve documents by a keyPrefixQuery – matches the start of a string valueRangeQuery – searches starting at one term and ending at another (useful for date searches)BooleanQuery – lets you combine other queries using AND, OR, NOT operationsPhaseQuery – finds terms a specified distance from one anotherFuzzyQuery – matches terms similar to a specified term
  • Examples of query syntax.
  • Some odds and ends on Queries, filters and sorting.
  • We can finally dispose of our Lucene objects in versions 2.9.4 and later. If you’re using older versions, you must remember to try/finally the FSDirectory and IndexWriter.Remember that it’s much more efficient to add a bunch of documents within a single using statement than to open a new IndexWriter each time.
  • Illuminating Lucene.Net

    1. 1. Illuminating Lucene.Net:Bringing Full-Text Search to LightW. Dean Thrasher14 May 2013
    2. 2. Agenda• About the presenter• About Lucene.Net– What it is– What it does– How it works– Who uses it– Why you should care
    3. 3. More Agenda• Core concepts– Lucene structure– Luke– Terminology• Code examples• Things to know• Recap• References
    4. 4. W. Dean ThrasherDean.thrasher@infovark.comwww.infovark.comwww.linkedin.com/in/deanthrasher@DThrasher@infovark
    5. 5. BACKGROUNDIlluminating Lucene.Net
    6. 6. What is Lucene.Net?Lucene.Net is a port of the Lucene search enginelibrary, written in C# and targeted at .NETruntime users.
    7. 7. What is Lucene?Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java.Apache Lucene is an open source projectavailable for free download.
    8. 8. History1997 – Lucene project began by Doug Cutting2000 – First open source release2002 – First Apache Jakarta release2005 – Lucene becomes a top-level project2006 – Lucene.Net gets Apache incubation status2010 – Lucene.Net orphaned by original committers2011 – Lucene.Net reaccepted into Apache Incubator2012 – Lucene.Net graduates from the Incubator
    9. 9. Why you should careYou want to providecustomers with a“Google-like” searchexperienceYou want to tuneincoming queries orresults rankingYou want betterperformance than SQL“like” searchesYou want to avoiddeploying a separatesearch tool with yourwebsite or application
    10. 10. What does it do?• Allows you to index and search vast amountsof text quickly• Provides a powerful query syntax• Integrates into applications easily
    11. 11. How it works• Lucene uses an inverted index– Maps terms to the documents that contain them• Lucene manages its index– Stores the index in memory or on disk– Allows documents to be added or removed• Makes an index for each document• Merges the index with a set of other indices
    12. 12. Who uses Lucene.Net?• Stackoverflow• RavenDB• Sitecore• Orchard• MindTouch• Umbraco• Sitefinity• SubText
    13. 13. CONCEPTSIlluminating Lucene.Net
    14. 14. Differences between Java and .NetThe Lucene.Net API:• Lags a few steps behind the Java version ofLucene• Takes advantage of advanced .Net features notfound in JavaBut it:• Preserves the core Lucene concepts• Maintains indexes that are compatible with theJava version
    15. 15. Logical Index Storage• Field – a name/value pair• Document – a sequence of fields• Index – a collection of documents
    16. 16. Physical Index Storage• Lucene generates aseries of files within asingle directory• Moving an index is acopy-and-pasteoperation• You can compress or zipan index to archive it
    17. 17. Luke• Lucene Index Toolbox• Built in Java, but canread Lucene.Netindexes• http://code.google.com/p/luke/
    18. 18. Analyzers and Tokens• Analyzers take strings of text and break theminto tokens• Tokens are chunks of text and associatedmetadata
    19. 19. Terms, Queries and Hits• Terms – the basic unit for searching. A fieldname and a value to seek.• Queries – combine terms to form searchcriteria• Hits – a ranked list of pointers to documents
    20. 20. CODE EXAMPLES
    21. 21. Create documents demo• IndexWriter• Directory• Analyzer• Document• Field
    22. 22. Read documents demo• IndexReader• Term• Query• Hits
    23. 23. Update documents demo• IndexWriter• Document• Term
    24. 24. Delete documents demo• IndexWriter• Query• Term
    25. 25. Search demo• IndexSearcher• QueryParser• Query• Term• TopDocs• ScoreDoc
    26. 26. THINGS TO KNOWIlluminating Lucene.Net
    27. 27. Transactional Lucene• Lucene supports ACID commits to its indexes• Lucene uses the Commit and Rollback syntax,much like relational databases.• Source:http://blog.mikemccandless.com/2012/03/transactional-lucene.html
    28. 28. Lucene index typesFSDirectory• Stores indexed documentson disk• Persists data across sessions• Best choice for mostapplicationsYour first choiceRAMDirectory• Stores indexed documentsin memory• Entire index must fit intoavailable memory• Does not persist data• Faster than FSDirectoryUseful for unit testing
    29. 29. Precalculation• How you store things in Lucene matters –choose field options and analyzers carefully• The way you retrieve information determineshow it should be stored• Smaller indexes give you better performance
    30. 30. Field.StoreYes – stores the text in its original formNo – the original text is not preserved
    31. 31. Field.Index• No – the field is not indexed, so it is notsearchable• Not analyzed – the text is treated as singleunit and indexed whole• Analyzed – the text is broken down intotokens and indexed
    32. 32. Field.TermVector• No – Does not store term vectors• Yes – Stores the term vectors of eachdocument (terms and number of occurrences)• With Positions Offsets – Term vector, tokenposition and offset information
    33. 33. Field types indexing optionsField Stored Analyzed VectoredId Yes Not analyzed NoModified Yes Not analyzed NoPath Yes Analyzed NoContent No Analyzed With Positions OffsetsAn example of storing fields related to files onyour computer.
    34. 34. Analyzers• Break apart text into tokens; each token getsindexed separately• Remove stop words• Decide how to handle punctuation• Handle languages and case sensitivity• You can create your own by building fromscratch or chaining exiting analyzers
    35. 35. Types of Queries• TermQuery• PhraseQuery• RangeQuery• PrefixQuery , Wildcard Query• FuzzyQuery• Use BooleanQuery to combine them
    36. 36. Query syntaxQuery Type Purpose SampleTermQuery Single word query scarlettPhraseQuery Matches terms in order “frankly my dear”RangeQuery Matches documents between theterms[1861 to 1865]{1861 to 1865}WildcardQuery Lightweight regex-like term matching Atl*D?m?PrefixQuery Matches terms that being with thestringWar*FuzzyQuery Closeness matching cry~BooleanQuery Combines other queries into complexexpressionsScarlett AND “frankly mydear” -voldemort
    37. 37. Query, Filter, and Sort• Lucene.Net can handle all three• Default sort is by relevance• Prefer queries to filters – they perform better
    38. 38. Using Dispose()
    39. 39. Linq Providers• LINQ to Lucene• http://linqtolucene.codeplex.com/• Lucene.Net.Linq• https://github.com/themotleyfool/Lucene.Net.Linq• Chris Eldredge• MotleyFool
    40. 40. Recap• Why would I use a search engine?• Why would I use Lucene.Net?• How would I add Lucene.Net to my project?– Web– Desktop• Where could I go to learn more?• When can I buy Dean a beer?
    41. 41. REFERENCESIlluminating Lucene.Net
    42. 42. Web References• Lucene.Net – http://lucenenet.apache.org• Solr – http://lucene.apache.org/solr• Wikipedia– http://en.wikipedia.org/wiki/Lucene– http://en.wikipedia.org/wiki/Search_engine_indexing• Academic discussions– http://lucene.sourceforge.net/talks/pisa/– http://lucene.sourceforge.net/talks/inktomi/
    43. 43. Books• Lucene in Action,Second Edition• Michael McCandless,Erick Hatcher, OtisGospodnetić• Manning Publications• July 2010• http://www.manning.com/hatcher3/
    44. 44. Books• Taming Text• Grant S. Ingersoll,Thomas S. Morton,Andrew L. Farris• Manning Publications• January 2013• http://www.manning.com/ingersoll/
    45. 45. Books• Introduction toInformation Retrieval• Christopher D. Manning,Prabhakar Raghavan,Hinrich Schutze• Cambridge University Press• 2008• http://www-nlp.stanford.edu/IR-book/
    46. 46. Presentations• http://www.slideshare.net/nitin_stephens/lucene-basics
    47. 47. Blogs• http://blog.mikemccandless.com/
    48. 48. Sample FilesAll the literature shown in the code samplescomes from Project Gutenberg.http://www.gutenberg.org/
    1. ¿Le ha llamado la atención una diapositiva en particular?

      Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

    ×