• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Illuminating Lucene.Net
 

Illuminating Lucene.Net

on

  • 1,992 views

An introduction to Lucene.Net delivered by W. Dean Thrasher to the Washington, DC Alt.Net meetup on May 14, 2013

An introduction to Lucene.Net delivered by W. Dean Thrasher to the Washington, DC Alt.Net meetup on May 14, 2013

Statistics

Views

Total Views
1,992
Views on SlideShare
1,992
Embed Views
0

Actions

Likes
0
Downloads
23
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Egad, the PUNishment! Well, at least I didn’t have a boring “Introduction to Lucene.NET” title.
  • Oooh, an agenda. Aren’t I organized?
  • Please send me an email to get in touch with me. Keep up with what I’m doing on the Infovark website or on my LinkedIn profile. I’ve listed my twitter handles – personal and work – but I rarely log into Twitter for any length of time. Send me a private message if you want to get my attention on Twitter.
  • Doug Cutting had written search engines in other languages, but he wanted to teach himself Java. So the Lucene project began. Although he started building a commercial venture around the project, he decided that he preferred writing code to running a business. He open sourced the code in 2000.Lucene got adopted by the Apache Software foundation in 2001. Lucene.Net, which began as an independent port of Lucene, was accepted by the ASF in 2006.In 2010, Lucene.Net hit a rough patch, but thatnk’s to the efforts of the Alt.Net community, it was reintroduced to the Apache Incubator. In 2012, it graduated from the Incubator and became a full-fledged Apache project.
  • Inverted indexMaps terms to the documents that contain themTerms may include metadata to improve rankingTerms may include position data for proximity searches
  • These are a few examples of websites, applications, and platforms that use Lucene.Net. If I included those that use Lucene, the Java version, the list would be huge. Even if you don’t use Lucene.Net directly, chances are good that you use something that does. Lucene has become a foundational technology for many of the tools and sites we use today, but not many folks working on the Microsoft side are familiar with it. Some prominent Java examples include: LinkedIn, Twitter, IBM’s OpenFind, and many more.
  • The .Net version is catching up with the Java version, but it remains nearly a full version behind.The .Net API is much nicer to work with, having good collections and generics support.Tools that interact with a Lucene index will work regardless of the Lucene library that created it.
  • Although we’ll be working with the Lucene.NET API tonight, many of the concepts you’ll hear will apply to any search engine, though the specific terminology may differ a little. Let’s review some basic definitions we’ll use throughout the rest of the presentation.Index – a collection of documentsDocument – a sequence of fieldsField – a string name/value pair
  • Luke is one of the ugliest applications I’ve ever seen, but it’s extremely useful. It exposes just about every aspect of the Lucene API, so it makes a great test-bed for trying out different ideas.
  • Analyzer – breaks field values into tokensToken – a tuple consisting of a chunk of text and its associated metadata. Tokens are the raw bits that gets indexed.(Tokens and terms are closely related.)
  • Query – a way to ask a question of an indexTerm – a tuple containing a field and a value to seek
  • Here are some of the key classes used to add documents to the index.I really ought to add some details to the slide for folks who can’t see the code sample.
  • Updating is a fairly new operation in the Lucene.Net API. Under the hood, it’s doing a Delete operation then an Add operation.
  • Did you know that you can use an IndexReader to update and delete documents, too? Yes, but I don’t recommend it. This is one of the parts of the API that’s getting revised in the near future.
  • Unlike a relational database, there’s no “normal form” to guide you when structuring a Lucene index. The key thing to remember is that the
  • Keeping the original text within the Lucene index is convenient, but can vastly increase the size of your indexes.
  • Term Vector Yes
  • Just an example of how you might combine the flags when adding fields to a document.
  • TermQuery – retrieve documents by a keyPrefixQuery – matches the start of a string valueRangeQuery – searches starting at one term and ending at another (useful for date searches)BooleanQuery – lets you combine other queries using AND, OR, NOT operationsPhaseQuery – finds terms a specified distance from one anotherFuzzyQuery – matches terms similar to a specified term
  • Examples of query syntax.
  • Some odds and ends on Queries, filters and sorting.
  • We can finally dispose of our Lucene objects in versions 2.9.4 and later. If you’re using older versions, you must remember to try/finally the FSDirectory and IndexWriter.Remember that it’s much more efficient to add a bunch of documents within a single using statement than to open a new IndexWriter each time.

Illuminating Lucene.Net Illuminating Lucene.Net Presentation Transcript

  • Illuminating Lucene.Net:Bringing Full-Text Search to LightW. Dean Thrasher14 May 2013
  • Agenda• About the presenter• About Lucene.Net– What it is– What it does– How it works– Who uses it– Why you should care
  • More Agenda• Core concepts– Lucene structure– Luke– Terminology• Code examples• Things to know• Recap• References
  • W. Dean ThrasherDean.thrasher@infovark.comwww.infovark.comwww.linkedin.com/in/deanthrasher@DThrasher@infovark
  • BACKGROUNDIlluminating Lucene.Net
  • What is Lucene.Net?Lucene.Net is a port of the Lucene search enginelibrary, written in C# and targeted at .NETruntime users.
  • What is Lucene?Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java.Apache Lucene is an open source projectavailable for free download.
  • History1997 – Lucene project began by Doug Cutting2000 – First open source release2002 – First Apache Jakarta release2005 – Lucene becomes a top-level project2006 – Lucene.Net gets Apache incubation status2010 – Lucene.Net orphaned by original committers2011 – Lucene.Net reaccepted into Apache Incubator2012 – Lucene.Net graduates from the Incubator
  • Why you should careYou want to providecustomers with a“Google-like” searchexperienceYou want to tuneincoming queries orresults rankingYou want betterperformance than SQL“like” searchesYou want to avoiddeploying a separatesearch tool with yourwebsite or application
  • What does it do?• Allows you to index and search vast amountsof text quickly• Provides a powerful query syntax• Integrates into applications easily
  • How it works• Lucene uses an inverted index– Maps terms to the documents that contain them• Lucene manages its index– Stores the index in memory or on disk– Allows documents to be added or removed• Makes an index for each document• Merges the index with a set of other indices
  • Who uses Lucene.Net?• Stackoverflow• RavenDB• Sitecore• Orchard• MindTouch• Umbraco• Sitefinity• SubText
  • CONCEPTSIlluminating Lucene.Net
  • Differences between Java and .NetThe Lucene.Net API:• Lags a few steps behind the Java version ofLucene• Takes advantage of advanced .Net features notfound in JavaBut it:• Preserves the core Lucene concepts• Maintains indexes that are compatible with theJava version
  • Logical Index Storage• Field – a name/value pair• Document – a sequence of fields• Index – a collection of documents
  • Physical Index Storage• Lucene generates aseries of files within asingle directory• Moving an index is acopy-and-pasteoperation• You can compress or zipan index to archive it
  • Luke• Lucene Index Toolbox• Built in Java, but canread Lucene.Netindexes• http://code.google.com/p/luke/
  • Analyzers and Tokens• Analyzers take strings of text and break theminto tokens• Tokens are chunks of text and associatedmetadata
  • Terms, Queries and Hits• Terms – the basic unit for searching. A fieldname and a value to seek.• Queries – combine terms to form searchcriteria• Hits – a ranked list of pointers to documents
  • CODE EXAMPLES
  • Create documents demo• IndexWriter• Directory• Analyzer• Document• Field
  • Read documents demo• IndexReader• Term• Query• Hits
  • Update documents demo• IndexWriter• Document• Term
  • Delete documents demo• IndexWriter• Query• Term
  • Search demo• IndexSearcher• QueryParser• Query• Term• TopDocs• ScoreDoc
  • THINGS TO KNOWIlluminating Lucene.Net
  • Transactional Lucene• Lucene supports ACID commits to its indexes• Lucene uses the Commit and Rollback syntax,much like relational databases.• Source:http://blog.mikemccandless.com/2012/03/transactional-lucene.html
  • Lucene index typesFSDirectory• Stores indexed documentson disk• Persists data across sessions• Best choice for mostapplicationsYour first choiceRAMDirectory• Stores indexed documentsin memory• Entire index must fit intoavailable memory• Does not persist data• Faster than FSDirectoryUseful for unit testing
  • Precalculation• How you store things in Lucene matters –choose field options and analyzers carefully• The way you retrieve information determineshow it should be stored• Smaller indexes give you better performance
  • Field.StoreYes – stores the text in its original formNo – the original text is not preserved
  • Field.Index• No – the field is not indexed, so it is notsearchable• Not analyzed – the text is treated as singleunit and indexed whole• Analyzed – the text is broken down intotokens and indexed
  • Field.TermVector• No – Does not store term vectors• Yes – Stores the term vectors of eachdocument (terms and number of occurrences)• With Positions Offsets – Term vector, tokenposition and offset information
  • Field types indexing optionsField Stored Analyzed VectoredId Yes Not analyzed NoModified Yes Not analyzed NoPath Yes Analyzed NoContent No Analyzed With Positions OffsetsAn example of storing fields related to files onyour computer.
  • Analyzers• Break apart text into tokens; each token getsindexed separately• Remove stop words• Decide how to handle punctuation• Handle languages and case sensitivity• You can create your own by building fromscratch or chaining exiting analyzers
  • Types of Queries• TermQuery• PhraseQuery• RangeQuery• PrefixQuery , Wildcard Query• FuzzyQuery• Use BooleanQuery to combine them
  • Query syntaxQuery Type Purpose SampleTermQuery Single word query scarlettPhraseQuery Matches terms in order “frankly my dear”RangeQuery Matches documents between theterms[1861 to 1865]{1861 to 1865}WildcardQuery Lightweight regex-like term matching Atl*D?m?PrefixQuery Matches terms that being with thestringWar*FuzzyQuery Closeness matching cry~BooleanQuery Combines other queries into complexexpressionsScarlett AND “frankly mydear” -voldemort
  • Query, Filter, and Sort• Lucene.Net can handle all three• Default sort is by relevance• Prefer queries to filters – they perform better
  • Using Dispose()
  • Linq Providers• LINQ to Lucene• http://linqtolucene.codeplex.com/• Lucene.Net.Linq• https://github.com/themotleyfool/Lucene.Net.Linq• Chris Eldredge• MotleyFool
  • Recap• Why would I use a search engine?• Why would I use Lucene.Net?• How would I add Lucene.Net to my project?– Web– Desktop• Where could I go to learn more?• When can I buy Dean a beer?
  • REFERENCESIlluminating Lucene.Net
  • Web References• Lucene.Net – http://lucenenet.apache.org• Solr – http://lucene.apache.org/solr• Wikipedia– http://en.wikipedia.org/wiki/Lucene– http://en.wikipedia.org/wiki/Search_engine_indexing• Academic discussions– http://lucene.sourceforge.net/talks/pisa/– http://lucene.sourceforge.net/talks/inktomi/
  • Books• Lucene in Action,Second Edition• Michael McCandless,Erick Hatcher, OtisGospodnetić• Manning Publications• July 2010• http://www.manning.com/hatcher3/
  • Books• Taming Text• Grant S. Ingersoll,Thomas S. Morton,Andrew L. Farris• Manning Publications• January 2013• http://www.manning.com/ingersoll/
  • Books• Introduction toInformation Retrieval• Christopher D. Manning,Prabhakar Raghavan,Hinrich Schutze• Cambridge University Press• 2008• http://www-nlp.stanford.edu/IR-book/
  • Presentations• http://www.slideshare.net/nitin_stephens/lucene-basics
  • Blogs• http://blog.mikemccandless.com/
  • Sample FilesAll the literature shown in the code samplescomes from Project Gutenberg.http://www.gutenberg.org/